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FOREWORD 


We appreciate the opportunity which the author has 
graciously given to the Actuarial Society of America to publish 
this book. Representing the fruits of his own intensive work 
in the field of Mathematical Statistics as an actuary, sup- 
plemented by his experience in lecturing to actuarial students, 
this volume should be of great assistance to students who are 
working on basic elements of theory and to actuaries who wish 
to acquire greater mastery in more advanced aspects of the 
subject. We believe that it will also be consulted by those 
who work in statistical fields outside the actuarial profession. 

The Actuarial Society of America is making this publication 
available in pursuance of its policy to aid in the educational 
facilities for entrance into the actuarial profession and in the 
proficient pursuit of actuarial science. We acknowledge with 
deep gratefulness the author’s devotion to his profession in 
contributing his own time and efforts to these same ends. 

J. M. Laird, 

President. 

R. D. Murphy, 

Chairman of the Committee 
on Actuarial Studies. 


THE J^%TUARIAL SOCIETY OF AMERICA 




''Look to the essence of a things whether it be a point 
of doctrine, of practice, or of inter pr elation 

Marcus Aurelius. 
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PREFACE 


It may possibly at first seem strange that yet another book 
should be prepared on Mathematical Statistics, when so many 
admirable text-books on various phases of the subject have been 
published within recent years. 

Actuaries, however, and others, such as vital statisticians, 
who in their work find themselves concerned, sooner or later, with 
the foundations of the theory of probability as well as with its 
application to advanced and special problems concerning life 
contingencies and allied questions, have never yet been able to 
avail themselves of any single ordered and comprehensive treat- 
ment which would give them precisely those portions of the 
theory and practice which they require. The field is so vast, the 
literature so scattered and extensive, and the viewpoints from 
which the subject has been presented are so varied, that no present 
publication has seemed to the writer — in a long experience of 
lecturing to actuarial students — to fill the somewhat highly spe- 
cialized requirements of the actuary. 

These considerations form the sole reason — indeed the only 
justification — for this book. In it an attempt has been made to 
assemble and co-ordinate those portions of the theory and appli- 
cations of Mathematical Statistics which are really needed, both 
by actuarial beginners in their studies, and by qualified actuaries 
in the solution of the problems which arise in practice. The 
selection of material, and the treatment, have thus been devel- 
oped along special lines, with a particular objective. In some 
portions the phraseology adopted has been deliberately repeti- 
tious — for the main objective has been to clarify the mathe- 
matical foundations of the subject, without, however, bewilder- 
ing the ordinary student by an intricate maze of highly condensed 
symbolism. 

One other viewpoint should perhaps also be explained. In 
the teaching of these matters there is, inevitably, a cultural 
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responsibility, as well as a merely pedagogic duty — only by an 
adequate presentation of the former can the latter be fully and 
satisfactorily achieved. It may well be desirable, amongst all 
the facets and complexities of modern scholarship, so to present 
the practical achievements in any field of learning that he who, 
of necessity, must run at the tempo of modern life may read with 
conservation of his energy; but this accomplishment must leave 
much of the cultural background unassimilated unless some reas- 
onable attempt is made to picture the historical development. A 
recital merely of the present state of knowledge, moreover, incurs 
the danger — seen more than once in the scientific world — that 
some investigator in the future may re-discover work in ignorance 
of similar research undertaken years before. I therefore believe 
it essential to the proper understanding of any subject to absorb 
the history of the mental processes which have guided its develop- 
ment. This study, accordingly, is framed on that conviction. It 
is hoped, however, that the arrangement used will enable the 
reader to acquire the background easily, and in a manner less 
destructive of imaginative interest than is so often inseparable 
from the teaching of history per se. 

In one major respect that arrangement is, I believe, novel. 
An endeavour has been made to reduce the distractions which 
inevitably arise when the many essential explanatory discussions 
and extended mathematical analyses are inserted in the main 
text. The body of the treatment has therefore been designed as 
a condensed presentation only, from which the principal ideas 
may be acquired in an orderly and easy manner — the subsidiary 
questions which naturally arise, and which must, of course, be 
answered, being dealt with by reference to separate portions of 
the book. By this means it is hoped that the student approach- 
ing the subject for the first time, or the graduate who may desire 
to refresh his views, will be able early in his reading to obtain a 
comprehensive picture of the whole, while yet possessing ample 
opportunities to elaborate the background or the current details 
as and when he may desire. 

The preparation of this book has been in my thoughts for 
many years. On the outbreak of war in 1939 the manuscript was 
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already nearly finished, and its early publication was intended. 
The dislocations of recent months, however, have caused unavoid- 
able delay; but its completion was encouraged by the desires of 
many students, and by the interest of professional friends. I 
therefore hope that publication now may be justified, despite the 
war, by any assistance the book may give to those who, in the 
years to come, will carry the burdens of dealing, statistically or 
actuarially, with the problems either of a war economy or of a 
saner and more peaceful world. 

The final stages before publication have been facilitated 
notably by correspondence and conversations with Dr. W. 
Edwards Deming on certain points in the history of statistical 
theory, and the ‘‘Student” and tests. Mr. Donald D. Cody 
has generously devoted many hours to a critical reading of almost 
the whole manuscript, and has contributed valuable suggestions 
resulting in clarifications and some enlargement of the text. The 
diagrams (except Figure 20, which was evolved by Dr. Deming 
during a discussion of the “Student” theory) have been prepared 
by Mr. John B. McKinnon, who also has assisted greatly in the 
arrangement of the bibliography. Mr. Ray D. Murphy has 
given most encouraging co-operation as Chairman of the Com- 
mittee on Actuarial Studies of the Actuarial Society of America. 
It is a real pleasure to record my thanks to these four friends. 

Hugh H. Wolfenden 

Toronto. 

February, 1942. 



I. INTRODUCTION 


Certain aspects of the study of Mathematical Statistics have 
always seemed to present special difficulties to students. This 
is particularly true in those portions which involve the classical 
Theory of Errors and the Method of Least Squares, and their 
relation to some of the more advanced developments of the modern 
Theory of Sampling, systems of Frequency Curves, and the 
Method of Moments. 

In the case of actuarial students these difficulties have been 
due only slightly to any lack of basic training in the earlier mathe- 
matical requirements — for they come to the subject with a sound 
practical knowledge of the elements of the Theory of Prob- 
ability, and with an adequate facility in most of the necessary 
fundamentals of the differential and integral calculus. They are 
attributable rather to a hiatus which exists in the usual courses 
of study. Notwithstanding some improvement within recent 
years, students are still generally plunged almost directly from 
those simple elements of probability and calculus, as they are 
taught in the text-books, into all the complex ramifications of 
Mathematical Statistics, without any really sufficient prepara- 
tion in the underlying theories. It is consequently little wonder, 
when there, almost for the first time, they meet constant refer- 
ences to an enormous mass of historical literature, and encounter 
the classical disagreements and still prevailing conflicts engen- 
dered by metaphysical speculations on the nature of probability 
and the theories flowing from it, that even the best students 
sometimes feel bewildered, and have sought a presentation of the 
subject which would lead by easy stages through the apparent 
maze. 

It is the purpose of this volume to attempt that task. Only 
the elements of probability and calculus will be assumed as 
known. From that point the developm^ent will follow largely 
the classical discussions, with emphasis upon the essential place of 
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the Theory of Errors and Least Squares. It has always seemed 
to me impossible to approach the modern theories with any hope 
of success if we leap over all those concepts — “errors”, the “nor- 
mal curve”, “mean square of error”, “probable error”, “weights”, 
“normal equations”, and so forth — which impelled so many of 
the controversies and yet have formed the stepping-stones of 
history towards more recent methods. The ultimate result of 
neglecting any of these fundamental matters can only be con- 
fusion of the student's mind. 

In thus treating the subject from its elements and through 
its classical discussions an endeavour will be made, however, to 
free the body of the text, as much as possible, from historical 
descriptions, elaborate demonstrations, and instances of appli- 
cation. These three types of information will therefore be found 
in three appended sections. Their relegation in that manner is 
not, of course, intended to imply their unimportance — for, as a 
principle of pedagogy, nothing is more essential in the presenta- 
tion of any subject than a picture of the background (given in 
Section A on History), an understanding of the technical analysis 
(provided by Section B on Mathematics and Interpretations), 
and an appreciation of the special practical utilities and applica- 
tions (shown in Section C on Applications). It is therefore hoped 
that references to these portions of the study will not be dis- 
regarded. They will be given in brackets, in the form (p. 163; 
A; 7), for example, meaning that the supplementary material will 
be found at p. 163 of this volume, in Section A, Part 7, in that 
case. 

Historically important and currently useful publications will 
be brought together in a Bibliography, and identified by italic 
numerals in two lists, with page reference where possible, and the 
letter H or P to indicate “historical” or “present” value — as 
H:55:329, meaning the “H” list of works of historical signifi- 
cance, publication number S3, p. 329 thereof. An attempt has 
been made to give complete (though not redundant) documenta- 
tion, for two reasons — firstly, because it seems desirable, in mat- 
ters so inseparably conc(!rned with the logical dilemmas of philo- 
sophical probability, to refer wherever possible to those publi- 
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cations which may properly be considered as authoritative in 
their respective fields; and secondly, in order that the student, if 
he should feel dissatisfied at any stage with the development or 
views set out herein, may have readily available the sources of 
original deductions and supplementary discussions on each topic. 



11. THE NATURE OF THE PROBLEMS 


The actuarial student is ordinarily introduced to the elementary 
notion of Probability in his text-books on algebra. Following 
instruction in permutations and combinations, he is led to a 
‘‘definition’’ of the concept, is taken through the “addition” and 
“multiplication” rules, and is quietly permitted to assimilate — 
in his own fashion, and generally without question at that stage — 
the ideas of “mutually exclusive”, “independent”, “dependent”, 
and “equally likely” events. He is then exercised actively in the 
combination and unravelling of the probabilities of complex sets 
of occurrences, in preparation for the further drilling which he 
must receive in his manipulation of such probabilities in dealing 
with life contingencies. 

It is well that, in this curriculum, he is not brought deliber- 
ately into touch with those subtleties of thought and metaphysical 
speculations which he now must meet, and in some way resolve, 
in attempting to apply his text-book training in the wider field 
of Mathematical Statistics. For in this wider field he will en- 
counter early certain philosophic doubts as to the precision of 
the text-book “definitions”; he will wonder whether the hitherto 
accepted axioms may not perhaps, from another point of view, 
be theorems; he will discover intriguingly conflicting attempts to 
interpret “equally likely” events; and he will find, all through 
the history, and still today, disputes concerning the validity of 
“inverse” (“inductive” or “a posteriori'") probability (the “prob- 
ability of causes” in the past), much argument expended on its 
relationship to “direct” (“deductive” or “a priori") probability 
(the probability of a specified event in the future), and disagree- 
ment yet about those “paradoxes” to which so much analysis has 
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been devoted notwithstanding often a fundamental inability to 
settle precisely the nature of the problems.* 

It is accordingly in an atmosphere of controversy that we 
must make a start. In order to avoid the danger of confusing 
the student at this stage, we may omit from the body of this 
treatment any discussion of the extremely interesting specula- 
tions which may be formulated with regard to the nature of 
probability and, consequently, the best method of approaching 
it. The nature of probability is in many respects so elusive a 
conception that it must necessarily be an arbitrary effort to 
defend a “best” approach, except within certain stated limitations 
(p. 179; B; 1). We shall, however, select/ with reason, one ap- 
proach, namely, that which leads as directly as possible to the 
concept of statistical frequency. It will be found, upon examina- 
tion, to be adequate for treating the essentially practical aspects 
of probability with which actuaries and similar investigators are 
constantly brought into touch, while it avoids many—though, as 
will be seen, not all — of the logical dilemmas and practical diffi- 
culties of the other methods (p. 183; B; 1). 

This concept of statistical frequency may be approached, in the 
first place, by considering the essential meaning of James 
Bernoulli’s Limit Theorem (H:^) — the ''Law of Large Num- 
bers'' — which, in broad terms, may be stated thus: “If p be the 
true probability of the happening of a certain event in a single 
trial, n a number of trials, and 5 the number of times the event 
is observed to happen in those n trials, then, as n increases, the 
probability approaches certainty that the statistical frequency, 

— , will approach p" (see p. 187; B; 2). This statement may 
n 

s 

evidently be interpreted as meaning that lim — =p. The ques- 

n-^co ft 

tion which therefore immediately presents itself concerns the 

♦These questions concerning * ‘inverse" probability are deliberately, al- 
though reluctantly, excluded from this study. They are not essential for 
the purposes immediately in view — ^although the student will discover ulti- 
mately that he cannot escape from their due cbnsideration if he is to grasp 
fully the meaning and implications of the whole subject. 
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situation which arises when the number, w, of trials is not infin- 
itely great. We are thus brought at once to the necessity of 
investigating the deviations which are likely to occur when the 
number, «, of trials is limited (see also p. 263; C; 1). 

In its classical form — circumscribed by certain specific and 
hampering assumptions — the theory of deviations so approached 
became known as the Theory of Errors, in which the Law of 
Large Numbers, the symmetrical Normal Curve of Error, and 
the ‘‘adjustment’* of observations by the consequent Method of 
Least Squares represent the important subdivisions. In its more 
recent developments — released from the necessities of those ear- 
lier assumptions — it leads directly also to the Lexis Theory, the 
Theory of Random Sampling, Poisson’s Law of Small Numbers, 
the general representation of “probability distributions’’ or “fre- 
quency distributions’’ by both symmetrical and unsymmetrical 
functions as in the Pearsonian Frequency Curves and Generalized 
Normal Curves (with the Method of Moments for fitting such 
curves), and to the Test for Goodness of Fit. 



III. THE CLASSICAL APPROACH 


The important distinction stated in the preceding chapter 
between the true a priori probability, />, and the statistical fre- 
quency, say, — , resulting from 5 actual observations of the 
n 

success the happening) of an event in, let us suppose, n 
trials — where n is not infinitely great — may be crystallized in 
practical form (see also p. 187; B; 2) by the following question: 

What is the probability, in a finite number of trials, n, in each 
of which the true probability of the event happening is a constant 
p (and the true probability of its not happening, or 1— p, is g), 
that the event will actually happen np+x times, i.e., will show 
a deviation of +:)k: from the np occurrences which would be ex- 
pected (according to the Law of Large Numbers) if n were infin- 
itely great? (See p. 264; C; 2.) 

This probability is immediately expressible according to the 
usual rules of probability for independent trials with constant 
probability p. For, since the trials are independent, the prob- 
ability of the event happening exactly np+x times in any given 
order is />***'+*, and of its failing the other n — {np+x), or ng— 
times, is g**®""*; and since all the different sequences of the np+x 
happenings (and the consequent nq^x failures) can occur in 
”Cnp+x different arrangements, the total probability, y* say, is 


that is. 


"Cnp+x g***”*, where />+g = l 




VLI ^np+a;^nfl— « 

inp+x)\{nq—x)\ 



This expression, of course, is the general term of the series 
obtained by setting out the probabilities of exactly 0, 1, 2, ... n 
successes, i.e., by putting np+x =0, 1,2, . . . w; that is to say, it 
is the general term of the series 

g"+Mpg"-i .+/>" = (<Z+P)’‘ . . . .(3) 

1.2 


10 
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which is often referred to as the Bemoullian Series or the Point 
Binomial. It may also be useful at this stage to introduce into 
the terminology the idea conveyed by its alternative description 
as the Bemoullian or binomial Frequency Distribution (see p. 189 ; 
B; 3), since the series in fact obviously represents the relative 
frequencies (not the actual frequencies, which would be n times 
the relative frequencies) with which the various possible numbers 
of happenings are theoretically distributed over the range from 
no occurrences to complete success when p remains constant 
throughout (see p. 264; C; 2), It will be apparent that the dis- 
tribution is unsymmetrical when p 9 ^q, as in the example shown 
in Figure 1, and that when p=qi = i) the shape becomes sym- 
metrical, as in Figure 2 (see also p. 189; B; 3). 

From these expressions certain functions of great importance 
can now be deduced easily. It will be well to set out the demon- 
strations in extenso, since exactly the same principles will be 
required later in more complex circumstances. 

First of all, let us examine the average, or mean, expected 
number of happenings in the n trials. From (2) and (3) it is 
clear that the probability of 0 happenings is q^, with consequently 
an expected number g”(0); that the probability of 1 success is 
npq^~~^j with an expected number, therefore, of npq^~^{l)] and 
so on until finally we reach the last term p^{n). The total ex- 
pected number of happenings in the n trials is therefore the sum 
of these terms, or 

g»(0) +«/)g"-i(l) + +. . . +/>"(«) 

1.2 

=^np{q+py''^=np (4) 

since g+/> = l. 

What now would be the average of the expected squares of the 
number of happenings? By reasoning similar to that above it 
can immediately be written down as 

g"(0)2+w/)2"-Kl)^+ . .+/>"(«)* 

1 .2 
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=np [g’‘-‘+(M-l)/)g'*-* (2)+ +/>"-*(»)] 

=»/>[(g+^>)’‘-» + («-l) p{q-\-pY-^] 

-np[l + {n-\)p\=np{np-\-q) (5) 

The preceding formulae can, of course, be put very simply in 

t^n 

terms of summations — (4) being written S "G (/), and 

(5) as S ''Ct (0^- They are, moreover, In fact the mean 

/-o 

and the second moment (see p. 253; B; 27) of the binomial dis- 
tribution with reference to the beginning of the range. 

Suppose now that, instead of (5), we investigate the expected 
squares of the deviations measured from the mean np, as defined by 

t=^n 

the formula 2 ^Ct q^'^^ p\t — npy .... (6) 

/=o 

This is the second moment about the mean (see p. 253; B; 27), 
and is known as the mean square deviation (see p. 162; A; 6). 
In expanded form the expression is 

q^{n^p^) +nq^^^p{l—2np+n^p'^) +. . +p'^{n^ — 2n^p+n^p^) 

which, by the same methods as were used to establish (4) and (5), 
reduces easily (for proof see p. 202; B; 4) to 

npq .... (7) 

The standard deviation, or, being defined as the square root 
of the mean square deviation, is consequently 

<7 = V npq .... (8) 

It will thus be seen that, up to this point, some of the most 
obviously important problems, as they arise naturally from the 
simplest examination of the theory of probability, have been 
dealt with by the ordinary processes of algebra. It may be well 
to emphasize here that, in so doing, the student has been intro- 
duced to the following basically important notions: (a) The 
distinction and relation between a true probability, a priori, and 
the observed statistical frequency, a posteriori; (b) the most 
elementary type of theoretical frequency distribution for the 
number of successes in n trials, i.e., the discontinuous ^'Bernoul- 
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lian series’* or **poInt binomiar*; and {c) three functions derived 
therefrom, namely, the ^'mean” np^ the “mean square deviation” 
npq^ and the “standard deviation” y/npq. 


Now, as is shown at p. 264; C; 3, the preceding methods 
clearly are quite simple, and do not encounter any difficulty even 
if the number of trials, be made very large. There are, how- 
ever, other functions which can be derived from the point bi- 
nomial without undue effort if n is small, but which obviously 
would entail prohibitive labour when n becomes large. Suppose, 
for example, it were desired to find the probability of a deviation 
between certain stated limits, which would necessitate the sum- 
ming of the corresponding terms of the point binomial ; it would 
be an easy enough matter when n is small, but impossibly for- 
midable if, as very often happens, n is large. It consequently 
became essential to seek a method which would bring the pro- 
cesses involved within the realm of practical accomplishment 
when the number of trials, w, and the consequent factorials, are 
increased greatly. We therefore now proceed to set out the 
classical approach to this important question. 

The problem is to find a method of dealing with the funda- 
mental probability, already given in (2), of np+x successes 
(rather than the expected number np) in n independent trials 
with constant probability p, i.e., the probability 


when n is large. 


li: QTiq-x 

{np-\-x)\ {nq—x)\ 


....( 2 ) 


This can be done by Stirling’s formula (p. 161; A; 1) that* 


(1 + -^ +. . .) (9) 

12n 


Thus, replacing the factorials in (2) by this expression with the 
term involving neglected (see p. 151; A; 1), taking logar- 


% denotes “is approximately equal to”. 
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ithms, and reducing, we obtain easily (as on p. 203; B; 5) the 
approximation 

1 - — 

yx^ — —c ( 10 ) 

V 2'Knpg 

which may be identified as the Normal Law of Deviations 
(P:15^:173). Its discovery is to be attributed to De Moivre 
(p. 151; A; 2). 

It will be remembered that np-\-x is here restricted to integral 
values, so that the formula — being in fact at this stage an 
approximation to the point binomial — represents still a bell- 
shaped series of ordinates (of the type of Fig. 2). It is, however, 
also to be observed that, whereas (2) started out as a symmetrical 
or unsymmetrical series, according as p = or (10) — through 

the neglect of certain terms in the development of the approxi- 
mation (as shown on p. 204; B ; 5) — has emerged as a symmetrical 
expression in every case (whether ^ = or 5*^3), since y* =y-.x- For 
the purposes of obtaining, through this approximation, a method 
by which calculations can be performed when n is large, this 
transformation of the sometimes unsymmetrical expression (2) 
into the necessarily symmetrical formula (10) is not as serious a 
change as might at first appear. The facilities which it provides 
are very great, and its results are remarkably accurate, except in 
the comparatively unusual cases when g (or p) is so small and n 
sufficiently large that ng^ (or np) remains finite but small (less 
than about 10, perhaps), i.e., q (or p) is small but n is sufficiently 
large that the event happens only very occasionally (see p. 267; 
C; 4, and p. 310; C; 14). 

If we now write c = \^2npq the expression remains simply an 

1 - — 

approximate formula y* ^ ^ • But if its discon tinu- 

c\/Tr 

ous series of ordinates, depending upon n, p, and g, be now 
smoothed out into a continuous bell-shaped curve (Fig. 8 on 
p. 69) — depending either approximately upon specific values of 


3 
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n, p, and q, or upon other general and non-specific considerations 
— ^it may obviously be imagined in the form 

/(*) = .... ( 11 ) 

Cy/T 

In this guise it has become famous as the Normal Curve of 
Error, upon which the classical theory was founded (see p. 151; 
A; 3), and to which modern theories — despite many efforts to 
break away from it — still not infrequently return. 

It will be important for the student to remember that in 
determining formulae (4) and (5) the origin was taken at the 
beginning of the range, but that the deviation x in (2), (10), and 
(11), and formulae (6), (7), and (8), are based on the origin taken 
at the mean. 

From the preceding method of approach it will be seen at 
once that there is a close analogy, but also a distinction, between 
the deviation in (10) and the error x as it obviously can be visual- 
ized in the more general Normal Curve of Error (11). The devi- 
ation X in (10) proceeds by integers (since np+x is integral, 
although neither np nor x is necessarily integral), and depends 
on known a priori values of w, p, and g; the ^‘error** x in (11) 
clearly includes also the concept of a difference, either integral or 
fractional, which may be supposed to have occurred from any 
cause or causes, whether expressible a priori in terms of w, p, 
and q, or to be related to the true value a posteriori by analysis 
of a set of observations (see p. 187; B; 2, and p. 263; C; 1). The 
analogy is important on account of the legitimate support which 
it provides for the almost simultaneous deduction of both (10) 
and (11), as is shown here; the distinction, however, is likewise 
to be emphasized, because it involves much of the philosophical 
debate which stimulated the attempts to “prove” the Normal 
Law (see p. 151; A; 3, and p. 158; A; 4). 

A number of important functions can now be deduced easily 
from (10) and (11). Remembering that the deviation x in (10) 
is measured from the mean, np (as shown in (4)), and since (10) 
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is, in fact, an approximate expression for the point binomial when 
n is large, we shall begin by using the calculus to establish the 
values of the mean, and of the mean square deviation, which 
have already been found for the point binomial by the use merely 
of simple algebra. It will be convenient and sufficient to deal 
here with the continuous form ( 11 ), and to direct attention to 
p. 206; B; 6 with regard to the restrictions necessitated in the 
case of ( 10 ) by the requirement of finite integration. 

The probability of a deviation, or error, x being then, by ( 10 ) 

and ( 11 ), expressible as — 7 — ^ , where c = \^ 2npq in ( 10 ), we 

C-y/T 

may take that of an error between x and x+Sx (where 8x is 


1 

infinitesimally small) as e dx (see p. 208; B; 7). In n 

C\/'K 

trials, accordingly, the Expected number of deviations or errors 
lying between x and would be C hx, with an ex- 


pected magnitude of xe $x,. When, therefore, a devi- 

cy/i: 

ation or error of any magnitude and either sign is possible, as in 
( 11 ), so that the range of x is from — 00 to + 00 , it follows that in 
the n trials the total expected magnitude of the deviations or 

^ r -l-oo ^ 

errors would be — -- xe dx, and consequently that the 

c\/tc J _oo 

average magnitude, with reference to a single trial, is 


1 

Cy/ TT 



dx = 




= 0 ....( 12 ) 


That is to say, the mean here =0, i.e., the origin of ( 11 ) is at the 
mean, which obviously corresponds properly to the determina- 
tion in (4) of the mean as np for the point binomial (3), for there 
the origin was taken at the beginning of the range (see also 
p.208;B;7). 

If, instead of giving effect to the sign of x, as in the derivation 
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of (12), we take the mean of the absolute Values irrespective of 
sign, we find the 

average or mean (expected) error, irrespective of sign, or rj, 

2 f C 

= xe dx= ....(13) 

Cy/ic Jo 


by (t:) and (d), p. 209; B; 7. 

Similarly, the average of the expected squares of the errors, 
i.e., the mean square error or second moment, or variance (see 
p. 163; A; 6) 


r+co _ jc* 

x^e dx = 


C\^Tr 


x^e 


dx— — 
2 


....(14) 


by (^), p.210;B;7. ^ 

The standard deviation, o-, being defined as the square root 
of the mean square error as in (8), is therefore 


a = 


c 

V2 


.(15) 


The probable error, X, or ‘'quartile deviation” (see p. 192; 
B; 3) is the error such that there is an even chance that any error 
will fall short of or exceed it in absolute magnitude. That is, X 
is to be determined from 


_1 
Cy/'K I 


1 e ^^dx — \, or — 


-v/tt’, 


e ^*dt = l 


.(16) 


from which, by tables of the probability integral (p. 161; A; 5), 

X = . 476936c* ....(17) 

From the above results it follows that 


c = '\/2npq — VT^ (mean error) =V2 (standard deviation) 

= — (probable error) . . . . (18) 

.476936 ^ ^ 

or c = V^q = 1.7724547J = 1.414214<7 = 2.096665X .... (19) 


*The values are stated here to 6 places for purposes of record. In practice, 
however, 3 places are quite sufficient. 
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and hence, for the conditions of ^'simple sampling'' (see Chapter 
V) underlying (10), 

Mean Error = .797885 = .797885(r = f 0 - approximately 

....( 20 ) 

and 

Probable Error = .674489 = . 674489<r = f(r approximately 

....( 21 ) 

The order of magnitude of these quantities is the probable 
error (X), mean error (r;), standard deviation (o*), and modulus 
{c — see p. 163; A; 6), as shown in Figure 3 (in which c is taken 
as 1, and therefore X = .48, rj = .56, cr = .71, 2(t = 1.41, and 3X = 1.43, 
to 2 places). 



Figure 3. — Relative Loc*ations of X, rjt ^nd their Multiples, 

in the Normal Curve 
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It will be seen immediately from this diagrammatic presen- 
tation that X, rjf (T, and c vary in their significance. For example, 
remembering that, by definition — formula (16) — the probability 
of a deviation lying within the range zhX is .5, it can easily be 
found similarly from tables of the ‘'probability integral*' (see 
p. 161; A; 5) that the probabilities of a deviation lying within the 
ranges itX, rb2X, ±3X, zb4X, zt5X, . . . are .500, .823, .957, .993, 
.9993, . . . , or within the ranges ±cr, ±2(r, ±3(7*, db4(r, ... are 
.6827, .9545, .9973, .99994, . . . , respectively. In other words, 
in a **normar' distribution the proportions of the total area in- 
cluded between the curve, the jc-axis, and the ordinates through 
±:X, it:2X, db3X, or zt:4X, are 50%, 82%, 96%, or 99%, while the 
corresponding percentages for the ordinates through ±(r, db2o-, 
or rl:3<r are 68%, 95%, or over 99%. 

It follows also, therefore, that the proportions outside ±X, 
±2X, it3X, or db:4X are 50%, 18%, 4%, or 1%, and that those 
outside ±(r, dt:2or, or zb3or are 32%, 5%, or under 1%. 

These results, furthermore, may obviously — by reason of the 
symmetry of the Normal Curve — be stated comparably in terms 
of either the plus or minus halves of the discrepancies, taken 
separately. Thus the probabilities of a positive deviation lying 
beyond +X, +2X, +3X, +4X, +5X, ... are .250, .089, .022, .0035, 
.00035, . . . , with the same probabilities for negative deviations 
beyond —X, —2X, — 3X, — 4X, — 5X; and the probabilities of a 
positive deviation beyond +<r, +2(7’, +3 (t, +4(r, . . . , are .1587, 
.0228, .00135, .00003, and the same for a negative deviation 
beyond — cr, ~'2(r, — 3<r, — 4 <t, . . . (see p. 269; C; 6). 

It will be useful to remember from these results that, for a 
“normal" distribution (i) twice or three times =h<r includes an 
area about the same as three or four times ±X; {ii) a deviation 
of more than dr2<3r or rt3X is very unlikely (the percentage of 
occurrence being less than 5%) ; and {Hi) ±Zfj and zh4X embrace 
over 99% of the deviations. 

The preceding account of the Classital Approach will, to this 
stage, have familiarized the student with the following basic 
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ideas (in which are now again included, for completeness, those 
already noted up to p. 13): 

(a) The distinction and relation between a '‘true” probab- 
ility, a priori, and the observed "statistical frequency”, a pos- 
teriori ; 

(h) The discontinuous, and symmetrical or unsymmetrical, 
"point binomial” as the most elementary type of "frequency 
distribution” for the number of successes in n trials; 

{c) The "mean” np, the "mean square deviation” npq, and 
the "standard deviation” \^npq, for that "point binomial”; 

(d) The symmetrical "Normal Law of Deviations” (10) as a 
close approximation to the "point binomial” (except in special 
circumstances), and the symmetrical "Normal Curve of Error” 
(11) as a general representation of "deviations” or "errors” 
from the mean; 

(e) For these expressions, the mean at the origin == 0; 
c — y/2npq\ the mean error = ; the mean square deviation 

y/ir 

= npq; the standard deviation = and the probable error 

= .476936<;; and 

(/) The improbability, for a "normal” distribution, of a 
deviation of more than dz3(r or ±4X. 



IV. THE COMBINATION OF OBSERVATIONS 


We now come naturally to a further series of basic formulae, 
which arise directly from the classical approach under the condi- 
tions of independent trials and constant probability. It may be 
asked at once, for example, what form the expressions take when 
the problems are viewed from the standpoint of relative frequencies 
rather than actual occurrences, or when a number of independent 
trials are combined. The formulae will therefore now be devel- 
oped, with their limitations and with illustrations, for the mean 
square error of {a) a multiple of the number of occurrences; {b) a 
linear compound of n independent quantities; {c) any function 
of n independent quantities; and — as special cases of (c) — {d) a 
product of two independent quantities; {e) the arithmetic mean; 
(/) the difference between two ratios ; and a logarithm. These 
formulae are of great importance. 

The expressions will be given throughout for the mean square 
error, by reason firstly of its descriptive character, which recalls 
to mind easily the nature of the problem, and secondly because 
it is equivalent to < 7 ^ — a parameter much used in the modern 
developments of Mathematical Statistics. Since, however, by 
(18), the mean error (r;), standard deviation (o'), and probable 
error (X) are all proportional to each other, it follows that, al- 
though the formulae will be given for cr^, all the expressions will 
be of exactly the same form for or X^. 

Applications of the various formulae to the special problems 
encountered in actuarial work are discussed at p. 272; C; 7. 

(a) The Mean Square Error of a Multiple 

In considering relative frequencies instead of actual occur- 
rences, the problem is simply that of determining the probabili- 
ties of deviations in relation to a stated number. The argument 
in reaching the preceding formulae (a^ pointed out in C;2 and 
illustrated in C;4, C;5, and C;6) concerned the deviations be- 
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tween an actual number of occurrences, s, and the mean expected 

number np^ i.e., the deviation s — np, or n( ^ 1 , in which — 

\n / n 

is the observed statistical frequency and p is the true a priori 
probability. If, however, the occurrences in relation to the n 
trials be considered, we should evidently be dealing with the 

deviation (i-^) between the observed statistical frequency 

and the true a priori probability, p» This simply means a divi- 
sion throughout by the invariable factor n. We may therefore 
write down from (13) that the mean error in the relative number 

of successes, — , is ^ ( “— ) J (15) that the standard devi- 
w n X's/tt/ 

ation ) ; and from (17) that the probable error is 

n \ V2/ 

(.476936^:). That is to say, since c = \^2npq, we have for / , 


the observed statistical frequency, that 

^2pq /- 
n 




= \/ TT r7 = \/ 2(T = 


.476936 


.( 22 ) 


which corresponds with (18). 

From the above principle it will also be apparent that the 
mean error, standard deviation, and probable error of an alge- 
braical or numerical multiple of s, say ks^ will be found from (18) 
by multiplying the appropriate relation therein by /c. 
Correspondingly, the 

Mean Square Error of ks is icV^, where is the mean square 
error of ^ .... (23) 


(b) The Mean Square Error of a Linear Compound 

This case concerns a very important matter, of which much 
use is made in the theory of graduation of mortality tables. The 
question may be stated, in its most elementary form, as the 
determination of the nfean square error of the sum of two inde- 
pendent quantities. 
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Suppose that two independent quantities have been observed, 
of which the true values are F\ and and that their errors follow 

1 1 

respectively the Normal Curves e and — -- e ** ; we 

seek to determine the law followed by the errors in their sum 

Clearly, the simultaneous occurrence of an error between 
X and x+bx in jFi, and of an error between y and y +5y in F2, will 
cause an error in F 1 +F 2 which will lie between x-^-y and 
x+hx+y+by. If z be written for x+y, the error in Fi+F^ 
may therefore be said to lie between z and z+5z, if bx, by, and 
bz are infinitesimal increments. Consequently, if an error x, 
lying anywhere between — 00 and + 00 , be committed in respect 
of Fi, then the remainder of the total error must be committed 
in respect of F2, and must lie anywhere between z—x and 
z+5z— a;. The compound probability of these two errors occur- 
ring together {Fi and F<i being wholly independent) will obviously 
be 

1 1 ra+4z— X y» 

dx e dy (24) 

C\/wJ—00 ky/'Kjn-x 

which, as shown at p. 211; B; 8, reduces to 

1 - — 

e bz. 

Since, therefore, this expression represents the probability of 
an error between z and z+bz, it follows that the probability of 
an error z is 

1 5!L 1 

-r - -:- ^ e , or e y* where y^ = c'^+k^ .... (25) 

^/'K yy/ir 

This remarkably elegant formula symbolizes the very impor- 
tant result that when the errors in Fi and F^ are independent and 
follow the Normal Curve with parameters c and k respectively, 
then the errors in the sum F 1 +F 2 also follow the Normal Curve 
with parameter 7 = y/c^+k^. 
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Since here c, k, and y are the parameters of three distinct 
Normal Curves of error, it will be seen that, by (14), their respec- 

tive mean square errors, are — , — , and — = — ' — . If, 

2 2 2 2 

therefore, we write <t\ and (t\ for the mean square errors of the 
independent observed quantities Fi and respectively, the mean 
square error of their sum F 1 +F 2 is 

o’l+fTj .... (26) 

The method of proof given for (24), (25), and (26) may 
evidently be extended for any number of independent quantities 
Fly Fly . . Fny so that the errors in the sum F 1 +-F 2 +. . . +-Fn will 
obey the Normal Curve with parameter 7 = 
correspondingly, the mean square error will be . .+o'^* 

Moreover, it has already been shown in (23) that the mean 
square error in kF is times the mean square error in F. If, there- 
fore, we have any linear compound, / 1 F 1 +/ 2 F 2 +. . .+/n/^n, of 
any number of independent observed quantities Fu -F 2 , . . . , Fn 
each obeying the Normal Curve with mean square errors 
o-p 0 * 2 , ...» (Tn respectively, where /i, hy • • • y In are multipliers 
(positive or negative, integral or fractional), then the 

Mean Square Error of the Linear Compound is 

lUl+lUl+...+llal ....(27) 

(c) The Mean Square Error of a Function 

It will now be convenient, in order to examine some other 
special cases of importance in actuarial work, to consider the 
general formula for the mean square error in any function 
F=f(Fiy . . . , Fn) of n independent quantities f i, . . . , Fny where, 
as before, Fu ... y Fn are the true values, and the respective mean 
square errors are (tJ, . . . , <r^. 

Now if errors Xu ... yXn are committed in Fu , . . y Fn respec- 
tively, the error in F will be 

f[{Fi+Xi)y . . . , {Fn+Xn)]-f{Fiy . . . , Fn). 
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So long as these errors Xi, , , . ^ Xn are so small that their squares, 
products, and higher powers may be neglected — an important 
limitation — this may be expanded by Taylor’s Theorem to give 
approximately 



But this is a linear compound, to which formula (27) is directly 
applicable; it therefore follows that the 

Mean Square Error of a Function F=/(Fi, . . . , Fn) is approx- 
imately 



It is to be noted that is based on the ‘‘true” values. 

dFt 

As special cases of the preceding general formulae which are 
often useful, we shall now examine the mean square error of 
{d) a product^ (e) the arithmetic mean, (/) the difference between 
two ratios, and (g) a logarithm. 


{d) The Mean Square Error of a Product 

The mean square error of a product, F1F2, of two independent 
quantities may be written down at once from (28) as approx- 
imately 


Fi<T{ + F\a\ 


.(29) 


{e) The Mean Square Error of the Arithmetic Mean 

If r precisely similar independent determinations are made 
of a single quantity Fi, and their simple arithmetic mean is then 
taken by the usual process of summing and dividing by r, the 
effect is the same as if one determinatibn^had been made of each 
of r independent quantities Fi, . . . , Fr, all with the same mean 
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square error, which were then combined by each one being 

multiplied by — . The mean square error of the arithmetic 

r 

mean of r independent determinations of a single quantity is 
therefore obtainable directly from formula (27) for a linear com- 
pound, by putting /i = . . . =/r= , and o-J = . . . = 0 -^ = (t2^ so 

r 

that (writing for the mean square error of the arith- 

metic mean — cf. p. 272, C;7) 

....(30) 

This is often expressed verbally by the statement that the stan- 
dard deviation of the arithmetic mean of r observations of a 

single quantity is times the standard deviation of a single 

Vr 

observation (see p. 287; C; 8). 

(/) The Mean Square Error of the Difference between Two 
Ratios 

The formula for this special case will often be met by the 
student in terms of w, p, and g, in which form it is useful for 
comparing rates of mortality, disability, withdrawal, retire- 
ment, etc. 

Suppose that in ni trials, for each of which the true probabili- 
ties are p and g, a number of successes si has been observed; 
and that in another independent set of trials, under the same 
conditions with the same true probabilities p and g, another 
number S 2 successes has occurred. Then, by (22), for the ob- 

served statistical frequency is — in the first set of trials, and 

Pa , 

— in the second set. If we now wish to determine or^ for the 

n2 

difference between the two observed values, formula (27) may be 
applied directly by taking Zi = l and / 2 = — 1, so that we obtain 
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The Combination of Observations 


( 1 )^ (^) +(-!)» (^) =pg(l+l) . . . .(31) 

\»i/ Vwa/ \wi W2/ 

Illustrations of its use are shown at p. 289, in (3) of C; 9. 

In this formula the true p (and g) are assumed to be known ; 
and it is to be noted that the true p and the two observed ratios, 

and ~ , may all be different. When, however, p (and $) 

til «2 

are not known it is necessary to form some estimate of their 
values. This may sometimes be done, as a practical matter, 
by the application of reasonable judgment to properly compar- 
able data — in which case p (and 5 ) are merely assumed on the 
basis of experience instead of being really known. In some cases, 
however, even this may be impossible; it is then essential, if the 
formula is to be used for practical deductions, to make some 
estimate merely from the data alone. Clearly, if the two sam- 
ples are ^Vandom** (see Chapter V) — which is really the object 

of investigation — their separate ratios, ™ and -- , could be 

, n\ n% 

amalgamated to give — — ^ as an estimate of p, and (31) would 
Wi+n 2 

become (cf. F:16:272 and P:5;^:192) 

....(32) 

\ni+W 2 / \ W1+W2/ \fii W2/ 

Some further practical modifications are discussed at p. 291; 

C 4 10. 


(g) The Mean Square Error of a Logarithm 

The mean square error of a logarithm, log« Fi, can be written 
down immediately from (28) as being approximately 




....( 33 ) 



V. THE THEORY OF RANDOM SAMPLING 


The Meaning of Simple Sampling 

The formulae of the preceding chapters have been based on the 
occurrences or failures of an event, in a group of n independent 
trials, for each of which n trials there is supposed to be the same 
true probability of either occurrence or failure, p or q. This 
assumption that the true probabilities are constant for every 
member of the group of n obviously means that the group is 
supposed to be absolutely homogeneous. Now when such a 
group of n persons is limited in size, so that the observed statis- 
tical frequency of deaths (let us say) can be taken only as an ap- 
proximation to the true probability of death, we are in fact dealing 
with a sample of the larger population or universe — sometimes 
called the ‘‘parent population*' or “parent universe’* — of which 
it forms a part. Such a sample, moreover, is said to be random 
when it has been so drawn, from its parent universe, that every 
member of the universe has had an equal and independent chance 
of being chosen as a member of the sample. The case of inde- 
pendent trials and constant probability is generally identified in 
the nomenclature as one of simple sampling. 

Modifications of the Conditions of Simple Sampling 

Applications of the formulae, under these “simple sampling** 
conditions, to certain types of problems concerning mortality 
and allied statistics have been illustrated on pp. 265 to 293, in 
C; 4 to C; 10. Those problems have been of such a nature that 
it has been possible to apply directly the basic formulae already 
deduced. One of the questions considered has been whether the 
variations in the observed statistical frequencies, in a series of 
V independent investigations based on wi, W 2 , • . . , trials (i.e., 
cases) respectively, can have arisen from mere chance, or must be 
attributed to some specific cause. If the variations cannot have 
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arisen from mere chance, so that the operation of a specific cause 
must be suspected, the conclusion so reached may indicate that 
the conditions of simple sampling have, in fact, not been fulfilled. 
The primary assumptions of simple sampling are that the true 
probabilities p and q remain unchanged throughout. If the 
result indicates that those assumptions may not have been ful- 
filled, the question arises whether the disturbance may be due 
to variation in the underlying true probabilities themselves. 

The Theory of Lexis (see p. 163; A; 7) was the first to deal 
with this important question in an obviously logical manner, by 
modifying the primary assumption of simple sampling that p 
and q remain unchanged. Three different types of sampling are 
defined as follows in that theory: 

(а) Bernoulli sampling — the simple sampling already con- 
sidered — for which the true p and q remain unchanged in every 
trial of a **set” of ni trials, and in every trial in the next set of n^ 
trials, and so on until we reach every trial in the last set of 
trials; i.e., p and q remain constant in every trial in every set, 

(б) Poisson sampling — otherwise called '^stratified'" samp- 
ling — for which p and q vary in the several trials in the first set 
of trials, and vary in the same way in each subsequent set of 
trials; i.e., p and q vary from trial to trial but are constant from set 
to set. 

(c) Lexis sampling, for which p and q are unchanged in every 
trial of the first set of trials, and are similarly constant but with 
a different value in each subsequent set of trials; i.e., ^ and q are 
constant from trial to trial but vary from set to set. 

In order to develop the mathematical theory for these three 
types it will be assumed that the sets are all of the same size, 
w, i.e., that Wi = . . = (The treatment of groups of un- 

equal size in practice is illustrated on p. 299; C; 11). 

(a) For the Bernoulli case, with n trials in each set, for 
the number of occurrences in each set, by (7), is npq. Conse- 
quently, if this be computed for each of the v sets, and the result- 
ing values be summed and divided by v to form the arithmetic 
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mean, the result (see p. 214; B; 9) is simply —npq, which 

is here denoted by <7|. ^ 

{b) For the Poisson case it may be shown (see p. 215; B; 9) 
that the corresponding value, which may be identified as (Tp, is 
less than that of the Bernoulli series, being 

<r%=<r% — nop .... (34) 

where al is the mean square deviation of the probabilities 

pu • • • t pn from their mean p = • . 

n 

(c) For the Lexis case, on the other hand (see p. 216; B; 9), 
the corresponding is greater than that for Bernoulli sampling, 

4 = ....(35) 

where is the mean square deviation of the probabilities 
pu t Pm from their mean p = . 

On the same principle as that which led to formula (22) for 
the standard deviation of a relative frequency we see that in 
these cases, likewise, for the frequency (proportion) of successes is 


obtained by dividing by giving 


Pq 

For Bernoulli sampling, -- 

n 

CO 

For Poisson sampling, — — — 

n n 

....(37) 

For Lexis sampling, ^ * 

00 


Two measures have been evolved in order to provide a con- 
venient method of distinguishing between the three types. The 
Lexis Ratio, usually denoted by L, is simply 

. L=- ....(39) 


4 
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where a is the standard deviation computed directly from the 
data, and is that calculated on the assumption that the data 
follow the Bernoulli (binomial or normal) type. 

This ratio, however, is dependent on the values of the chances 
Pi which are involved in and is also affected by variations in 
the number n, Charlier therefore suggested using a Coefficient 
of Disturbancy (or Variability) y 

lOoV^ ..,(40) 

A 

where <r and <tb are as just defined, and A is the arithmetic mean 
of the data (see p. 216; B; 9). 

When L = l, and C = 0, the data are normal and of the 
Bernoulli type. 

When L<1, and C imaginary, they are said to be subnormal 
and of the Poisson type. 

When L>1, and OO, they are said to be hypernormal (or 
supernormal) and of the Lexis type. 

In cases where the samples are small and the observed data 
are used to give estimates of the true values, a correction should 
be introduced on the principles of Bessel's formula (42), as 
explained in the next section of this chapter, and as noted in C; 11. 

It will be seen at once that — apart from the formal mathe- 
matical process by which the method is expressed in formulae 
(34) to (38) — the practical application of the principle is very 
easy. Simple Bernoulli sampling, in effect, is used as a standard 
of comparison. A computation is made of or* from the actual 
data; if it is less than the Bernoullian <t% of simple sampling, then 
the conditions for Poisson sampling are indicated, i.e., that the 
probabilities p and q may have varied more among the indi- 
viduals or groups within the sample than between samples; if it 
is greater than the of simple sampling, then the conditions 
of the Lexis distribution are suggested, namely, that the prob- 
abilities may have varied more from ^sample to sample than 
within samples. 
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In practice, of course, the conditions are hardly definable with 
the precision of the above theoretical formulation, so that values 
which may be actually calculated for L or C must be interpreted 
with some care. It must be remembered that the theory deals 
only with average values, and that in practice deviations from 
such average values may occur. If the number of samples is 
sufficiently large, and L is found to differ considerably from 1, 
there is good ground for the inference that the data are not 
Bernoullian; but if L is near to 1 the Bernoullian hypothesis is 
not necessarily established definitely — it should be considered as 
being plausible only (cf. P:Jf 4^:215). Furthermore, the formulae 
assume that the events are all independent; a value of less or 
greater than cl may therefore indicate, as an alternative inter- 
pretation, that the events are negatively or positively correlated 
(see P:i77:366). 

The fact that the practical application of the Lexis theory is 
thus based, in reality, merely on the computation of the Lexis 

Ratio, L=—, constitutes not only an important step in the 

classical theory of sampling — by its elucidation of the effects of 
removing the limitations of simple sampling — but also forms a 
valuable connecting link between the classical procedures and 
the ideas underlying the more recent development of the 

test” (for which see Chapter IX). For L^, being — , where 

is calculated from the data and c| is computed on the assumption 
that p and g are constant, is (as may also be seen easily from the 
hypothetical numerical illustration of the three types given at 
p. 294; C; 11) simply 

- S {fr--npY 

....(41) 

npq 

where /p/j, . . . ,/r are the observed occurrences; and this for- 
mula, as will become clear later from Chapter IX, is, in fact, 

- , under the particular conditions assumed (see p. 217; B; 9). 
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Modifications Necessary for Dealing with Small Samples 

In Section (/) of Chapter IV, and at p. 291 of C; 10, it has 
been remarked that when the true p (and g) are not known it may 
often be necessary to form some estimate of their values. When 
the conditions are ‘‘random”, and the sample is large so that the 
deviation between the true p and the observed statistical fre- 
s 

quency — will be relatively small, the considerations already 
n 

discussed indicate that the estimate of the true values for the 
universe may be obtained, for practical purposes, by adopting 
the observed values of the sample. This procedure clearly would 
become less of an approximation as «, the size of the sample, 
increases (cf. p. 292; C; 10). In the other direction, however, it 
will be equally apparent that the sample value will become less 
reliable as the sample becomes smaller. The investigations 
arising from the latter portion of this fundamental principle con- 
stitute the modern and very important Theory of Small Samples* 


The nature of the problem may be seen from an examination 
of the estimates, which can be made from the sample, for the 
arithmetic mean and the mean square deviation of the universe. 
Suppose that a sample numbering n is drawn, with magnitudes 


iff 1 / 

oCj, . . . , ; their arithmetic mean, say x, is then 5t— — 2 

n r*l 

and clearly the principle on which this procedure is based is the 
same whether the sample is large or small. In the case, however, 
of the mean square deviation, the in the universe is, by defin- 
ition, the average of the squares of the deviations from the mean. 


so that 0-2 = — 2 (Xf—my where the x/s are the values in the uni- 
R 


verse, R is the number of values of r over which 2 extends, and m 
is the true mean of the universe. When we come to deal with a 
limited sample from that universe, for example the observed 
values Xi, Xj, . . . , x^, we could obviously make an estimate of 

say (T*, by calculating — S which would ap- 
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proach more and more closely to the of the universe as the size 
of the sample increases. This estimate, however, involves m, 
the mean of the universe, which usually will be unknown. If we 

write the preceding (r«= — ^ {xr—x+St — mY, where ^ is the 

w r-l 

mean of the sample, it will be observed that this may be put as 
1 / 

0 *^= — S (:iCy— 3c)® + (dc— w)^ The first term here is the mean 

n r-l 

square deviation in the sample itself, say (rj, with reference to 
the mean of the sample (cf. p. 299; C; 11), and it is seen to differ 
from the estimate c] by the always positive quantity {ot — mY, so 
that clearly a biassed error would be involved if the mean square 
deviation as calculated from the sample alone were to be taken 
as an estimate of the true mean square deviation of the universe. 
The quantity (x — m)^, however, is the square of the deviation of 
the mean of the sample from the true mean of the universe, and 
by section (e) of Chapter IV its average value may be taken as 

“ SO long as the assumptions implicit therein are appropriate 

^ ( 7 ^ (X^ 

(see below). Furthermore — - — , since ol is the estimate of 

n ‘ n 

<r^. From the original expression this course of reasoning there- 

fore gives <r* = + (* - m)* = <r* + — ^ <rl+ ^ 

ft n 

whence 

This formula, in which the factor ( - ) is known as Bessel’s 

\n-l/ 

Correction, has been in use since the time of Gauss (see p. 164 ; 
A; 8). 

The actual process by which (42) is obtained does not depend 
upon n being small, so that the formula is applicable to large as 
well as to small samples. The introduction of the (n — 1) term 
in the denominator, however, has an appreciable effect upon the 
value only when n is small, and the method is therefore usually 
adopted in practice especially for small samples. 
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It must be realized that the formula is an approximation 
merely. It is based essentially upon the substitution, during the 

argument, of — for which is valid only as an average 

n 

for a large number of samples, so that the resulting (42) likewise 
gives only an average value and may therefore provide an esti- 
mate for any particular sample which may differ markedly from 
the true value. The contradictions thus present have been well 
analyzed by Steffensen in P:i4f^:12-20, where the corresponding 
formulae for the third, fourth, and higher moments are also dis- 
cussed (see p. 219; B; 11). 

Formula (42), by which an estimate, for the (r of the 
universe is obtained from the (s\ of the sample, is an important 
example in a procedure which in general may be called The 
Theory of Statistical Estimation. Such estimated values are 
sometimes referred to as Presumptive Values (see p. 219; B; 11). 
The preceding derivation, moreover, being founded on a mode 
of reasoning which, as indicated, cannot be viewed as wholly 
satisfactory, serves also to introduce the student to the realiza- 
tion that this problem of “estimation’’ is surrounded by certain 
disagreements and controversies, accompanied often by misun- 
derstanding. A brief reference to these difficulties is therefore 
essential here. 

The Theory of Statistical Estimation ; The Classical Approach — 
The Principle of Insufficient Reason 

The classical method of approach to formula (42) was based 
upon the obvious principle that (as stated at p. 263; C; 1) the 
drawing of a sample, s, from a population or universe, P, by some 
process of selection, 5, may be stated symbolically in the form 
5 = 5(P). More completely, suppose that an event, jE, can occur 
under only one of the mutually exclusive conditions Pi, P 2 , . . , Pn, 
and that it has been observed to happen 5 times in a sample of n 
trials. Since here, in this problem ofiestimation, we are given 
merely the observed set of $ occurrences, and wish to form an 
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estimate with respect to the parent population from which they 
arose, let Kr be the a priori probability that Fr exists, and Xr the 
a priori probability of producing the event E from Fr, where r 
takes the values 1, 2, . . .,n. Then the reasoning underlying 
the Bayes-Laplace Theorem (see p. 221; B; 12) shows that the 
total probability that one of the conditions Fi, F 2 , • • • , Fn exists, 
and that E will happen 5 times in n, is 

S (43) 

r-1 

The purpose of the Bayes-Laplace Theorem itself, as stated 
in formula (43a) on p. 222, is to deduce the probability, a poster- 
iori, that the particular condition F^ (say) was the origin of the 
set of occurrences observed. That, however, is not exactly the 
problem here. We are now seeking rather an estimate as to 
which of the various possible conditions is most likely to have 
produced the set of occurrences observed (namely, the happening 
of E actually 5 times in n). Under those circumstances the “best’' 
estimate clearly must be that which will give the greatest pos- 
sible value, a priori, to the probability, (43), of the occurrences 
actually observed. [It will assist the student to note that the 
“model” of the Bayes-Laplace Theorem is used in this formula- 
tion of the problem of estimation; we do not, however, need to 
go beyond the preliminary formula (43), so that formulae (43a) 
and (43b) on p. 222 are here not actually required.] 

In order to deduce by this means an estimate for cr^ the 
classical method is to consider n observed values x\, x^, • • • , x^, 
of a fixed true value X, under the conditions of the Normal Curve 
(11) — which, it is to be remembered, is a very close representa- 
tion of the point binomial involved in (43). The a priori prob- 
ability that all the deviations {x^-^X), where r = l,2, . . . , w, will 

occur together is then { — i-) ^ . This may be written 

\c\/7r/ 
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1 / 

—f{Xri X) say, where ^ Xr- 


'L(xr—x)* n[x—xy 


But Jf, being unknown, might lie anywhere between — oo and 
+ co; and if we accept the Principle of Insufficient Reason (see 
p. 181; (i) of B; 1, and p. 222; B; 12), which amounts to assuming 
that all the a priori existence probabilities, /Cr» are to be taken as 
equal — that is, as a constants, say — ^it follows that the total prob- 


ability of the observed deviations (Xr--X) is 


To find this integral we have to evaluate 


r+oo 

K f(Xr, 
J -oo 


X)dX. 


n{x^xy 


putting Vw {X—St) so that dX = , this is 

y/n 

J.r“rS&-£^ 

VwJ-00 Vw 


by (a) at p. 209; B; 7; and consequently 




X)dX=—e 

c»-i 




where K is written for the constant. To determine the value 
of c for which this probability will be a maximum we take logar- 
ithms, differentiate with respect to c, equate to zero, and obtain 

^ ^ — (2c*”®) 2 (jcJ — Jt)* = 0, whence ~ , which is 

= (— ) <r* as before. 
n-1 \n-l/ 


The “Best Unbiassed Estimate” 

The difficulties resulting from the*Principle of Insufficient 
Reason led Gauss originally (H:f7:49), and later the Russian 
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Markoff (H Jiff), to the concept of an unbiassed estimate. If we 
wish to “estimate” the value of a parameter 0, and for that pur- 
pose have at our disposal the values Xi^ 31 : 2 , , 3Cn, so that any 

estimate of 6 will be some function F{xi, :c 2 , . . . , Xr^ of those 
values, then F{xu 3 ^ 2 , ... , Xn) is called an “unbiassed estimate” 
of B if the mathematical expectation of F{x\, x^, , jCn) is iden- 

tically equal to 6. Since there are many such functions with 
expectations equal to 6, the best unbiassed estimate is taken as 
the one for which the “variance” (see p. 163; A; 6) is a minimum. 
This procedure is, of course, once more based on the use of a 
dogmatic “principle”. Markoff, whose work has recently been 
examined in English by Neyman (P:ff;^:130, and P:ffS:105), has 
dealt with those “best unbiassed estimates” which are linear 
functions of Xi, X 2 , . . . , Xn- By this method the estimate al is 

again found to be ( —^ ) 

\n-l/ 


The Method of Maximum Likelihood 

The introduction of the Principle of Insufficient Reason into 
the demonstration on p. 38 involves, of course, all the difficulties 
associated with the acceptance of that principle (see also P:9£: 
128-130, and P:;?S;146 and 151). Certainly in many instances it 
must be very difficult to justify the supposition that the unknown 
X might lie anywhere between the most extreme possible limits 
stated, namely — 00 and +00 ; and even if it could, it may be an 
even more sweeping assumption to suppose that it is equally 
likely to fall in any particular place in so extensive a range. There 
is consequently much to be said in favour of discarding that pro- 
cedure — particularly as the problem of estimation can be ap- 
proached logically, without the necessity of introducing the 
principle at all, by using the more direct Method of Maxi- 
mum Likelihood. By this method the reasoning is based solely 
upon the information afforded by the actual observations, and 
no assumptions are made»concerning a priori knowledge. In the 
case of the n observations previously considered the argument is 
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e , as there given, represents the 

a priori probability that the set of observed values will occur, and 
that the best estimates for X and c simultaneously (since both X 
and c are unknown) will be those which make this probability a 
maximum. Taking logarithms, then the partial differential co- 
efficient with respect to X, and equating to zero, we find imme- 
diately that S {Xf—X) =0, whence X = — and similarly dif- 

n 

ferentiating with respect to c it follows that 
2 n 

The estimate thus reached by the Method of Maximum Like- 
lihood, being a], is not the same as the obtained by the 

other methods. The distinction is important. It will be seen 
that in the classical method involving the Principle of Insufficient 
Reason the unknown X is dealt with by being supposed to lie 
anywhere, with equal probability, between — co and +oo, and 
then 0-2 is estimated by maximizing the resulting probability; 
in the Method of Maximum Likelihood no such assumption is 
made with respect to Z, but both X and simultaneously are 
estimated by maximizing the probability of the observed event 
without any assumption of a priori ignorance. The polemics 
which have been precipitated by these rival methods already 
have produced a literature far too extensive for detailed reference 
here. For the present purpose, however, it may be sufficient to 
repeat Neyman’s remark (P:P;^:135) that the question as to 
which is preferable is, in reality, “one of taste only”. 

Sampling . Distributions 

The problems of sampling follow from the idea of drawing a 
random sample from a parent universe.* If a group of n persons 
has been so derived, that group would have, with respect to any 


simply that 
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particular characteristic (such as age, height, weight, etc.), its 
own sample mean, its own sample standard deviation, and likewise 
its own sample value of any other statistical parameter which may 
be selected for determination. Another group of w, similarly 
drawn at random, would again have its own sample mean, 
standard deviation, etc. The process of drawing different sample 
groups from the parent universe could thus be continued until 
from a series of samples we should have found a series — a 
“distribution” — of sample means, a distribution of standard 
deviations, and distributions of other measurable characteristics 
— for the means of the various samples would not all be identical, 
the standard deviations of the samples would differ from each 
other, and the values of any other measurable characteristic would 
vary from sample to sample. The study of the forms taken by 
these sampling distributions constitutes an important section 
of the theories of both large and small samples. 

It has already been shown in formula (30) that the standard 
deviation of the arithmetic mean of n independent determina- 
tions of a single quantity is , where (t is the standard devia- 

Vw 

tion of a single observation. Viewing this result from the 
sampling standpoint here under consideration, the relation 

(t[A,M,} = — ^ can evidently be interpreted as giving the stan- 
y/n 

dard deviation of the arithmetic mean based on n samples, when 
the standard deviation of the parent universe is known to be < 7 . 
If, however, the universe o is not known, the formula can be 

written a [A.M] k- if 0-5 can be taken as an estimate of cr, or 
y/n 

as or [A,M.] = — by substituting from (42) when Bessel’s 
Vw~l 

correction is required. We thus reach an important formula and 
method for the practical determination of the standard deviation 
of the arithmetic mean— *for if we have a sample from which c, 
can be evaluated, and if there are n observations altogether, then 
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or - — for small samples, will measure the standard devi- 

v« vw— 1 

ation of the arithmetic mean of such a sample, i.e., it will give 
the standard deviation of the distribution of sample means. An 
algebraical proof may be found in P:S;?:189. 

The term standard error is often applied to the standard 
deviation of such a sampling distribution. A numerical appli- 
cation is shown on p. 300; C; 12. 

The general problem of determining the standard errors of 
parameters (of which the formula just given for the standard 
deviation of the mean is a simple case) can be presented most 
easily by the method used by Karl Pearson in H \93j of which the 
principles are reproduced conveniently in P:5;^:187-191 and 
P:1 77:394-411. The objectives of this study hardly seem to 
require the inclusion of the extensive algebra necessary for the 
derivation of the various formulae there shown. It may accord- 
ingly suffice if we here state the following results, where n denotes 
the number of individuals (variates) in the sample, and the 
parent universe is assumed to be normal with standard devi- 
ation or: 

Parameter Standard Error 

(i) Arithmetic Mean 

(ii) Median (see P:i55:199 

and P:/7^:134) 

(iii) Standard Deviation 

(iv) Mean Square Deviation 

(Variance) 

(v) gth Moment about a Fixed 

Point 

(vi) gth Moment about ^ 

the Mean V n 


y/n 

1.2533<r 

y/n 


y/2n 
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These formulae provide a method for determining the limits 
within which a sample value of a parameter will probably lie. 
When the assumption of normality is admissible, it is customary 
in practice to use d=3 times the standard error as defining those 
limits, in conformity with the conclusions of Chapter III, and as 
illustrated for the case of the arithmetic mean on p. 300; C; 12. 

The following points may also be noted: {a) The standard 
error of the arithmetic mean was obtained without reference to 
the form of the distribution ; (6) When the parent universe is not 
normal, the standard error of the standard deviation should be 

taken as — ^ f 1 + — — - 
y/2n \ 2 


) where /32 = ^ , which may differ 
^2 


considerably from the value based on the assumption of 

V 2w 

normality; (c) In comparing the formulae for the standard errors 
of the gth moment about a fixed point and the mean, respec- 
tively, it should be remembered that the mean of the population 
is a fixed point, so that the standard error in the qth moment 
about the mean of the population can be set down from (v) at once 

by dropping the primes (see p. 254; B ; 27) and writing 4 / 

whereas formula (vi) relating to the mean refers to the mean of 
the sample. 

“Student’s” Distribution 


The distinction which has already been emphasized in respect 
of small samples between (a) the standard deviation, c, of the 
parent universe, (b) the standard deviation, a,, of the sample, 
and (c) the estimate, of the standard deviation of the universe 
which can be made from by formula (42), indicates imme- 
diately the importance of considering the distribution of the 
standard deviation of a statistical parameter as calculated from 
the data of the sample only. For large samples the means from 
a normal population are distributed normally, with standard 

<T t Deviation of Mean , 

error as on p. 42; and the ratio ; — ; , by 

Vw Standard Error 

which is meant the deviation of the sample mean from the mean 
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of the universe, divided by the standard error of the mean, or 
(x w) jg likewise distributed normally, and— — ^nearly so. 




(^) 


\y/n 

For small samples, however, the ratio 4 does not follow a 


(^) 


Vn> 

normal distribution. The exact form was discovered in 1908 by 

(Sc-m) 


‘‘Student*’ (p. 165; A; 10), who used 


(Ts 


==z and found (see 


p. 226; B; 13) for the probability of a value of z between z and 
z+dz, or G(z), the expression 


G(z) = 


■d) 




which is known as ‘‘Student’s’’ Distribution. This curve is 
symmetrical in s, and more sharply peaked than the normal 
curve; as n increases it becomes nearly normal in the centre, so 

that a normal curve with standard deviation of ■ ■ provides 
an excellent approximation (P:;^^:140). V n — |- 


The important characteristic of this formula is its dependence 
on only one constant, m, of the parent universe; a is not involved. 
Since it gives the probability, in a sample of n when m is specified, 

of getting a value of =2 lying between z and z+dz^ it will 

be evident that the probability in a sample of n of finding the 

ratio in absolute value as great as or greater than that arising 

<^5 

from a stated x—m and the observed cr, is 
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'^Student’s” discovery thus provided a technique for examining, 
particularly for small samples, whether a value of actually 

^5 

observed in a given sample, when m is a specified value, is un- 
usually large or small. A small value of corresponds with a 
large value of z \ and a value of P, such as .05, for example, means 
that only 5 times in 100 trials should we obtain for the ratio 

a value as large as, or larger than, that actually observed. 

The inferences which may, or must not, be drawn from such a 
statement are considered later. 


If we now write (tI as the estimate, of in 

\n — 1/ (5c -m) 

accordance with (42), and define t as so that 

(^) 


_ (X-m) 

■fe) 


z\^n — ] 


and w — 1 is the number of degrees of freedom of t (since one 
degree is absorbed in determining x from the data, as explained 
in Chapter VI, p. 55), it follows from (44) that the distribution 
of t, say G(0i is given by 



where, as in Chapter IX, d is the number of “degrees of freedom”. 
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All the expressions (44), (46), and (47) will be found in the 
literature. For purposes of identification, (44) is often referred 
to as '^Student's z-distribution'\ while (46) and (47) are usually 
called ''Student's t-distribution" or simply "the t-distribution" . 

Tables of values of the integral of G{t) have been given by 
‘"Student” (see P:f 77:536), in P:P7, and by R. A. Fisher (P:4S: 
177), in several different forms. A very convenient nomograph 
for the calculation of P, in (45), devised by Nekrassoff (11:168), 
is reproduced in P:;^P:136 (and 115). 

The Assumptions and Meaning of the "Student" Method 

The publication of “Student’s” formula had far-reaching 
consequences. It seemed to release investigators from the neces- 
sity of following the earlier methods, which usually, in tests of 
significance, had simply taken the observed as an estimate of 
(T and then assumed a normal distribution (cf. H:f 55:101, and 
P:f55:78). This very fact, however, that the “Student” formula 
avoids consideration of the parent o altogether, by involving only 
<Ts directly without any apparent necessity for contemplating 
what <x itself may be, will show immediately that the formula 
must be applied with careful regard for its underlying assump- 
tions and for the precise hypothesis which it is employed to 
analyze — for obviously it may lead to unjustifiable inferences if 
the observed is not a reasonable estimate of <7. This important 
point at times has given rise to much confusion; it may therefore 
be well here to devote some space to the meaning of the “Student” 
method. 

Firstly, it must be emphasized (as will be seen from the proof 
at p. 226; B; 13) that the whole “Student” theory is based on the 
assumption that the distribution 6f the parent population is 
“normal”. The method is therefore clearly applicable when a 
sample has been drawn from a universe which is known, from 
prior knowledge, to be approximately normal (as, for example, 
a distribution of the heights of *ment). If, however, we do not 
know whether a sample has been drawn from a normal or from 
a non-normal universe, it must be remembered that the “Stu- 



The Theory of Random SnmpJmg 


47 


dent*' theory, as well as the classical ''normal” test, may then 
be open to question if the basic assumption of a normal parent 
population should be markedly invalid (cf. P:i^5:155-8, 

174, and P:7). 

Secondly, the method supposes (see again p. 223; B; 13) that 
the observations comprising the sample under scrutiny have been 
obtained by random sampling. The prime importance of this 
condition has frequently been overlooked. If an investigator is 
quite satisfied that a single sample which he has obtained has 
really been secured by a random process of selection, he is clearly 
in a position to apply the "Student” theory (so long, of course, 
as he also may assume that the universe is normal). If, however, 
he is not assured of this essential randomness, he obviously dare 
not apply the "Student” method to his single sample; he must 
then wait until further samples convince him that randomness 
exists; but when that point is reached he will, in fact, usually 
have enough data to determine the parent cr within close limits — 
and when he can do that he will be able to make valid tests by 
using the "normal” integral based on large samples rather than 
the "Student” small-sample theory. 

Thirdly, it must not be forgotten that, in any random samp- 
ling or probability procedure, it is always possible that some most 
unlikely event may actually occur. If that should happen in a 
single random sample which an investigator has obtained, it 
obviously will be dangerous for him to draw any inference what- 
ever beyond a realization that something unusual may have 
occurred. For this reason it will be clear, on a little consider- 
ation, that with a single sample the "Student” method may 
suggest a misleading inference unless all the elements inherent 
in the test are not unusual. 

In order to see what this statement really means, we may 
recall that in (44) the "Student” theory sets out, on the assump- 
tions of a normal universe and random sampling therefrom, to 
assign a probability to thfe occurrence in a single sample of the 
5c frt 

observed value when m, the population mean, is specified. 

0-5 


5 
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Nothing is here said about what value should have; it might, 
indeed, be thought at first sight that since o-, can take any posi- 
tive value whatsoever (as may be confirmed from the fact that 
in the proof at p. 226; B; 13 the integration is performed over all 
values of Cs from 0 to co), therefore a valid inference with respect 
to the parent population may be deduced regardless of what Os 
may be. The previous statement, however, that a misleading 
inference may be drawn unless all the elements inherent in the 
test are not unusual here serves to place on the requirement 
that it must be not unusual. For if it should be unusual, i.e., not 
representative of the a of the parent population from which it 
came, it will be quite clear that it may lead to an unusual infer- 
ence, i.e., to an inference which will not be typical of the popula- 
tion which is being tested on the evidence afforded by the one 
sample and its <r,. The proper inference then would be merely 
that an unusual event (here the appearance of an unusual af) had 
actually occurred. 

As a further illustration of the importance of thus being satis- 
fied that <r, is not unusual, it should be remembered also that we 
are dealing with a ratio, of Sc—m to in which both 5t and 
vary from sample to sample. Such a ratio may be small even 
when m is large, if at the same time <r, happens to be large 
enough. The mere fact, therefore, that in any particular case 

£ fyi 

is not unusual may not justify the assertion that 5t — m\s 

not unusual unless also we are able to say that a, is not unusual. 

Valuable additional remarks on these aspects of the “Stu- 
dent'* theory may be found in P:^P:135, 137, 139, and 141, and 
P:f5(?:59. Graphic illustrations are there given which may be 
of help in grasping the importance of the statement that must 
be not unusual, and in realizing the kind of inference which may 
properly be drawn. 

The Applicability of the *^Student'' Theory 

The “Student" integral has been misinterpreted often in the 
literature, through failure to realize these limitations — namely, 
the requirements of a normal population (or, in practice, of a 
population which at any rate is not markedly non-normal), 
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random sampling, and a not unusual The manner in which 
at first glance it seems to avoid the earlier necessity of using the 
normal theory, with taken as an estimate of o-, is sometimes 
more apparent than real. Indeed, as observed by Deming 
(P:iS0:59), the numerical refinement of making probability cal- 
culations by the * ‘Student” integral, rather than with the normal 
integral in which cr, is substituted for <t, “is not so momentous as 
has been proclaimed by many writers; of much more importance 
to the statistician is the fact that, whether he uses the classical 
estimate ... of <r, or ‘Student’s’ integral, he is at the mercy of 
the sampling fluctuations of even in controlled experiments”. 
Under any circumstances, therefore, the “Student” method must 
be applied, and then interpreted, with care. 

The most direct manner of applying “Student’s” distribution, 
either in his z form of (44) and (45), or in the I form of (46) or 
(47), is to test a hypothesized mean, m, on the evidence afforded 

by the value of observed in a single sample of n variates — 

subject always to the conditions already stated concerning nor- 
mality, randomness, and <7^. Illustrations are given in examples 
(1), (2), and 3(a) at p. 301; C; 13. 

Extension of the ^^Student” Method to Testing the Difference 
between the Means of Two Samples 

The same procedure may also be applied immediately, with 
similar reservations, to test the difference between the means of two 
samples, when either (i) both samples merely exhibit the effects 
of two different actions upon the same set of n individuals (as in 
example 3(6) at p. 304; C; 13), or (ii) the two samples exhibit the 
effects of two different actions upon two different sets, each of n 
individuals (as in example (4) at p. 305; C; 13). 

The latter use of two different sets, however, will obviously 
decrease the reliability of the results in comparison with method 
(i) which uses the same set, for there may be variations between 



50 


The Theory of Random Sampling 


the two sets even though both are assumed to have been drawn 
at random from a normal universe. For this reason, and also 
to provide a method of dealing with the problem when the num- 
ber, n, is not the same in the two sets, R. A. Fisher {P:39) showed 
that “Student’s” distribution can be extended, in order to test 
the difference between the means of two samples of different size, 
by treating the two different sets as two entirely separate series 
(cf. P:45:128-9, 130 (example 20), and 133 (second method); and 
P:1 10:586). Thus if Xi and be the mean and standard devia- 
tion of the first sample of Hi variates for r = 1, 2, . . . , Wi, and 
X 2 and 20-5 those of the second sample of W2 variates Xy for r = l, 
2, . . ., «2 — both samples being random drawings from a normal 
universe with mean m and standard deviation a — it follows from 
(27) and (30), on the assumption that the two samples are separ- 
ate (i.e., independent), that the mean square error (or “vari- 
ance”) of the difference between the two means is 

\ni n2 

Since, however, <t^ is unknown, an estimate of it, a], must be made 
from the data; and it will be seen, on the principles of (42) as 
extended at p. 164; A; 8 and p, 250; B; 26 for the case of k “con- 


straints”, that we may write 


2 

0r^= 


S {Xr-Xiy+ S 

r -1 


Wi“f"W2 — 2 


because k is to be taken as 2 in the denominator on account of 
the calculation being made from the two values, Xi and ^2, which 

are determined from the two separate series of the data. Evi- 

2 I 2 

dently this estimate c] will be ~ ; and substituting 

W1+W2 — 2 

this value for or^ in <t^ ( — + — ) , we obtain g] ( ~ + ~ J as 
\ni n^J \nx n^f 

the estimate of the variance of ^2. If now — analogously to 

the definition of i in (46) as the ratio of X’-m to the estimated 

standard error of x — we define i as the ratio of 5ci—X2 to the esti- 


mated standard error of — ^2, we have * t = 


5Ci — X2 



, where 
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(Te is found as already shown, and the table of the ^^Student’’ 
/-distribution is then entered by taking d as the Wi +«2 — 2 “de- 
grees of freedom”. An example is shown in (5) at p. 305; C; 13. 

Fisher’s z-Distribution 

The preceding method of testing the difference between the 
means of two samples uses the two sample variances, ^<7^ and ,(7^, 
in obtaining <7^ as an estimate of the of the universe, but it 
affords no test of the question as to whether both those sample 
variances can justifiably be used to give estimates of one and 
the same c-. Particularly in the light of the requirement, pre- 
viously stated, that <7^ in the “Student” theory must be not 
unusual, it is therefore important to develop a test of significance 
for the difference between two sample variances. Such a test, more- 
over, strictly should be applied before embarking upon the above 
test of the two means — for the assumptions underlying the test 
of the means would not be satisfied if the hypothesis that the two 
sample variances can be used for estimates of the same should 
be refuted. 

Suppose, therefore, that we have two independent estimates 
of (7^, namely, i(t\ based as before on a first sample of ni variates 
x'r, and 2 <r^ based on a second sample of W 2 variates Then, by 

r-ini r=»nj 

S U'-Xi)2 S Ur -^ 2 )^ 

(42), i<jI — and = — . In order to 

— 1 “ W 2 — 1 

surmount the mathematical difficulties in finding the distribution 
of the difference — R- A. Fisher {F:39; cf. also P:^;^:287) 

based his approach on the ratio ^ . The distribution of this 

2<re 

ratio follows easily (see (4) at p. 228; B; 13) from the distribution 
of (Ts found in (2) at p. 225; B; 13. Putting then 2; = loge(-~* 

\2<re 

and writing Wi~l =di “degrees of freedom” and ^2 — 1 —d 2 “de- 
grees of freedom”, it may be shown readily (as in (5) at p. 229; 
B; 13) that, if the two samples come from the same normal 
universe, the distribution of z, say F(z), is 
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dx-^-dt 


{d, 


....( 470 ) 


where the jB-function is defined as at p. 259; B; 29. 

This result is known as Fisher’s z-Distribution. [This 2 , of 
course, must not be confused with the z used by “Student” in his 
distribution (44)]. It is, in fact, a very general expression, from 
which the normal, yf (see Chapter IX), and “Student’s” distri- 
butions, and others also, are obtainable easily as special cases 
(P:55:807; or see P:iii:173, or P:i^0:605). 

Since (47^) is obtained on the assumptions of random samp- 
ling and a normal universe, and as again it does not involve the 
<r of the universe, its interpretation (though not its derivation) 
is also subject to the reservation that and must be not 
unusual, just as in “Student’s” distribution it was explained that 
must be not unusual. 


In practical applications of the method the customary pro- 
cedure is to test the hypothesis that lO-J and are estimates of 
the same universe If that hypothesis is true, the ratio of 
i<t\ to will fluctuate around unity from one experiment to 
another, and if and agree in giving the same estimate of 
a* their ratio would be 1 and z would be 0. For certain values 
of the degrees of freedom, di and d 2 (where d\ is used with the 
greater estimated variance), the values of z corresponding to the 
probabilities .05, .01, and .001 (the “6%, 1%, and .1% points”) 
have been tabulated by Fisher and Deming in P:4S:250-255, and 
have been reproduced (with permission) in many texts. 

An illustration is shown in (6) at p. 306; C; 13. 



VI. GENERALIZATION OF THE BINOMIAL LAW— 
THE ‘‘MULTINOMIAL” DISTRIBUTION 

In the preceding chapters the development has been founded 
on the “binomial law”, involving the two probabilities only, p 
and 2 ( = 1 This has been sufficient for practical application 

to many types of actuarial problems — for in such problems, like 
those concerning the occurrence of death or survival, the event 
under consideration either happens, or does not, i.e., the basic 
probabilities required are the probability of an event happening 
{p), and the complementary probability that it will not happen 
(1 — or g). And whenever it has been necessary to consider a 
series of such occurrences (for example, a series of deaths at v 
different ages, or in v various localities), the “mathematical 
model” constructed has been to regard each term of the series as 
independent, so that, in effect, the same basic model was used 
for each separate term of the series, and the series was eventually 
covered by successive independent applications of that one-term 
model. 

In the case, for example, of a series of observed deaths, 
^ 2 » • • • » which have arisen from , , , , E*, exposed 

to risk, at v different ages 1 to v, it is assumed in this method that 
the terms of the series are independent, so that, if gi, ^ 2 , . . 
are the “true” independent probabilities of death at the several 
ages, and pi, p 2 , • • , , p„ the corresponding “true” independent 
probabilities of survival, then at any age r (for r = 1, 2 , . . . , 
the probability (unless £'g<10), by the Normal Law of Devia- 
tions (10), of getting the observed deaths instead of the “true” 

Sr is — 7 =^=== e , and consequently for all ages 

V2ErPrqr 

r = l, 2, . . . , V the probability is the product of all these inde- 
pendent probabilities, namely. 
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(2)"^ VWiWz.-.W, e ^ 

ILfPr^r 

1 r (o' 0 )^n 

The exponent — S I — ^ ^ this probability, which here 

2 r»lL Erprqr J 

arises from separate applications of the binomial Normal Law of 
Deviations (10) to each of the v independent terms of a series of 
observations, should be noted carefully, for it will be encountered 
again in the development to be now explained (cf. p. 217). 


When, however, the terms of the series cannot thus be re- 
garded as independent of each other, it becomes necessary to 
extend the mathematical model so that the series of, say, v terms 
can all be dealt with at once. The procedure required is clearly 
a generalization of the binomial into a multinomial formula. 

Let us therefore first consider a hypothetical “parent*’ popu- 
lation only. Suppose, accordingly, that we wish to visualize the 
distribution over v “cells” of a total hypothetical parent popu- 
lation of N, in the form of /i, / 2 , . . . , falling into the cells 
numbered 1, 2, . . . , v respectively, where /1+/2+. . — so 

that the total N of the parent population is entirely distributed 
(such as hypothetical deaths Di, P2, . . . , at successive ages 

r=p 

1 to J' in a parent population, where S Dr = N). Then — recalling, 

r -1 

firstly, the argument used in establishing the binomial case — it 
is evident that, since we are here contemplating the parent popu- 
lation only, the “true” probability, say pry of one of the N items 

falling into the rth cell is ^ , and the probability of all the fr 
items falling into the rth cell is — But now — un- 

like the binomial model — we do not have to consider the comple- 
mentary probability 0-^) that an item will not fall into the 

rth cell, since that contingency is covered completely by the 
assignment of the other (r — 1) probabilities of falling into each 



The Multinomial Distribution 


55 


of the other (r — 1) cells. It is therefore clear that, introducing 
the necessary permutation, the probability of getting the parent 
distribution /i, / 2 , . . . i in the v different cells is 


N\ 


/i!/2!.../.! 


pi p2 


....(48) 


With this mathematical model clearly in mind, it will be 
apparent that if now we consider an actual case where the N 
items are observed to be distributed over the u cells in the series 
instead of in the theoretical series/ 1 ,/ 2 , . . . ,/„, then 
the probability of getting that particular observed series (out of 
the many such series which are possible as variations of the “true” 
/l>/2i • • •»/i*)ls , , . 

p!<p’k..p - ....(49) 


f 

where again the “true” probabilities are = — for r = 1, 2, . . . , v, 

N 

and in this particular case all the iV items are again distributed so 
that/i+/ 2 + • . • +/^ = iV’in conformity with / 1+/2 +. . .+/„ = iV 
for the parent population. This is the general term in the ex- 
pansion of the multinomial {pi+p 2 + - . 


The equality of the totals which is thus here imposed means that 

r- r r — I' r “ 

S/r= S /r = S iNpr)=N. Under these conditions, as the 

r-l r“l r»l 

total N is fixed, it will be seen that the last /' is determined auto- 
matically as soon as the other v — l of the/”s are assigned. [In 
the binomial case, for example, where j' = 2, the number of in- 
dependent variables is 1, since /( +/2 = iV, so that /^ is fixed when 
/i is known; in the multinomial case, correspondingly, there are 
V variables of which, however, only *'—•1 are independent, since 
/i+/ 2 +- • . + /p so that/', say, is fixed when f[, /j, . . . , fl-i 
are known]. In the usual terminology, there is in these cases 
one constraint — here a linear constraint because the condition 
imposing the constraint, namely, that/i +/2 + . . . +Sl=^Ny in- 
volves only the first powers of the/”s; and this one constraint 
leaves P’-l degrees of freedom, being the remaining v — 1 of 
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the which remain free to be assigned at will. Similarly, if, 
as will be encountered later, the conditions of a problem impose 
k such constraints, then there are v — k degrees of freedom (see p. 
175; A; 19). 


If now in the expression (49), we replace all the factorials by 
their approximations according to Stirling’s formula as in the 
deduction of the Normal Law of Deviations on p. 203; B; 5, and 

write =jV, the expression reduces (see p. 230; B; 14) to 

y/Npr 


(\/^ V plp2 - • -py 


-i 2 

r*»l 


..(50) 


as long as Npr is not less than about 10. 

This may be referred to as the Multinomial Normal Law 
of Deviations. Just as the introduction of the Stirling approx- 
imation into the general term of the point binomial, as given in 
(2), led to the symmetrical ‘^normal” approximation (10) with 
the two probabilities p and q, and maximum at the mean thence 
decreasing to zero at both ends, so here the resulting (50) would, 
in j/-dimensional space, be maximum in the neighbourhood of 
the “point” with co-ordinates Npu Npi, . . . , Np^, and would 
thence everywhere decrease symmetrically to zero. 

The fact that (50) is a generalization of the binomial form 
(10) may be seen readily. For the binomial represents, in reality, 
the filling of but two cells, so that i/ = 2, through the operation 
of pi=p and p2 = q] all the n cases are so distributed, so that 
N = n; 3 Lnd f[ = np+x and f 2— nq—x. Putting these values in 
(50), the result is at once (10) — for 


'Sfr is then (il+j1) = 

r«l 

_ (np+X'-np) 
np 


(f[-Npty (K-Np^y 

Npi Np2 

2 , (nq^-X'-nqY 


..•(A + A) 

\np nq/ 


npq 


since pi'^pt—p’^q — ^- 


nq • 
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From this analysis it will accordingly be seen that the 
approximate probability of getting a particular observed series 

f'r-NPr _/r-fr 


fv /j. • • • . /r is (50) in which j 
therefore, we here write 


IS 


V^r V/r 


If, 




r-1 


r 


2 

Xo 


Npr r-l fr 

^ (f[-Np^y (f,^Np,y . {fl-^Np^y 

Npi Np2 ‘ ‘ Np, 


we 


have 


V ^ 


(\/2irNy ^ V 


...(52) 


as the probability of getting the observed data f[,f 2 j • • i.e., 

the probability of the occurrence of the deviations (fr — Npr) 
where r = 1 , 2 , . . . , v. 

This Xo — ^ function of great importance, as will be seen later 
in Chapters VIII and IX — is thus the sum of the squares of the 
deviations of the observed from the expected parent values, each 
divided by the latter. It is not arbitrary, since it has emerged 
as an essential part of the derivation of the Multinomial Normal 
Law of Deviations, and is thus entirely consistent with the theory 
underlying that formula, of which the binomial is a special case. 


Now for the binomial we know from (10) and ( 11 ) that the 
probability of a deviation x lying between a \/ n and /3 \/ n is 
1 r0yfn 

— - e dx. Writing p—pi, q=p 2 y and n = N to 

V2vnpq]^yp: 

correspond with the multinomial case, and changing the variable 


_j f' 


\pl p2/ 


dti 


by putting — 7 =. =/i, this becomes 

Vn (2 

* t* 

Introducing now a second variable / 2 = —^ 1 , the term — L. in the 

pip2 
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exponent can be written i _1 + — ) . Hence the probability in 

\pi pif 

the multinomial (50), when ^^ = 2, that the deviation {fi — Npi)j, 
which is np+x — np=x in (10) and (11), will lie between ay/n 
and ^\/n {f[ being, of course, fixed by the linear constraint 


_i_r 

a/ 2ir y/ pip% Jo- 


dlu where h = —ti 


.(53) 


Extending this principle, therefore, for all the v variables of (50) 
and (52), it will be seen that the probability that the deviations 
(fr^Npr) of the series /i,/ 2 , . . . ,/v-i will all simultaneously lie 
between ai\/n and /3i\/n» between a 2 \/n and , 

and between a„_i\/w and respectively (the last vari- 

able then, of course, being fixed by the linear constraint that 

/i+/2+* • is 


times 

2 A 


(\/27r)‘' ^ y/pip2^ • *pi 

J Of J a* J 






.(54) 


where --(/ 1 +/ 2 +. . .+/y~i). 

This expression is of great importance in the theory establish 
ing Pearson's x^-test of Goodness of Fit (see Chapter IX). 



VII. FREQUENCY DISTRIBUTIONS AND CURVES 
IN GENERAL 


The Point Binomial and the Normal Curve 


It was observed in Chapter III that the Bernoullian, or point 
binomial, frequency distribution (3), i.e., {gt+pY, is of the “dis- 
crete’' class, and is symmetrical when q = p=\, but unsym- 
metrical when — under which latter circumstances the 

series of ordinates, although markedly skew for small values of 
q and ng, rapidly approaches the symmetrical form as nq increases 
(see especially p. 267 ; C ; 4). It was also shown that the ordinates 
of this distribution, when measured from the mean np, may be 
represented very closely by the always symmetrical Normal Law 


of Deviations 



2npq 


....( 10 ) 


except when the asymmetry of (3) would be very marked by 
reason of q (or p) being so small and n sufficiently large that nq 
(or np) remains finite but small, i.e., when q (or p) is very small 
but a sufficient number of trials, w, is made that the event does 
happen occasionally. Finally, (10) was expressed in the con- 
tinuous form of the Normal Curve of Error 



....( 11 ) 


which again is an always symmetrical bell-shaped curve with 
relationship to (10), and parameters, expressible as 

c = \^2npq — y/T (mean error, r}) — V 2 (standard deviation, a) 

= — (probable error, X) .... (18) 

.476936 


The “Sbew-NormaP’ Curve 

Notwithstanding the early belief that this symmetrical 
Normal Curve might, indeed, represent a universal law of nature 
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(see p. 151; A; 3), it was realized in the classical dissertations 
that the symmetrical forms (10) and (11) had been reached by a 
particular method of approximation, in which certain terms had 
been neglected. In the demonstration at p. 204; B; 5, for exam- 
ple, it is shown that (10) and (11) are, in fact, obtained by 
neglecting the last term in the Skew-Normal Curve 

1 -fi x{p-q) 

yx == e e .... (ii) 

c y/ir 

which, therefore, is in theory more strictly applicable when p¥^q» 
The effect of the last slightly skew term in this expression, how- 
ever, is very small except in extreme cases (see p. 265; C; 4), so 
that the Skew-Normal curve is of little practical importance. 


Poisson’s Exponential — The *^Law of Small Numbers” 

Another formula which has commanded much greater inter- 
est, and is certainly of major importance, was given so long ago 
as 1837 (H:;^0:205) by the French mathematician Poisson, and 
has become known as the Law of Small Numbers (see p. 166; A; 
11). The conditions for its applicability are precisely those 
under which the approximations inherent in the Normal Law 
are not satisfied, i.e., when q (or p) is so small but the number of 
trials, «, is sufficiently large that nq (or np), while small, is finite, 
so that the event only happens occasionally. Let us examine, 
therefore, the effect upon the fundamental 


n! 


{np+x)\ {nq—x)\ 


pnp-^-x gn 


....( 2 ) 


which represents the probability of {np+x) successes and («g— jc) 
failures in n trials, of supposing that q is very small but n still 
large enough that nq=m^ where m is small but nevertheless finite. 
Then (writing r for nq—x)^ it follows, as shown at p. 230; B; 15, 
that the probability of r occurrences of such a rare event in n 
trials is given by 

....( 66 ) 

where m—nq. 
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This Poisson exponential, of which several tables are available 

(see p. 234; B ; 15), is a discrete function, existing only for integral 

values f = 0, 1, 2, . . ., n. The single parameter, tn, is very easily 

determined from ,,, 

Mean = m 

It may also be shown (p. 234; B;15) that 

The marked skewness of its frequency polygon for small 
values of m, and the rapidity with which it approaches the sym- 
metrical ‘‘normal** form, may be seen from Figure 4. 


....(56) 

....(57) 



Figure 4. — Frequency Polygons of the Poisson Exponential 

While it is inadvisable to attempt an entirely specific answer 
to the question as to the exact statistical conditions under which 
the Poisson exponential is to be preferred to the “normal** form, 
it may, nevertheless, be indicated that the application of the 
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Normal Curve should be made with caution for values of q (or p) 
below about .03 when nq (or np ) — to which m in the diagram 
on p. 61 is an approximation — ^is 10 or less. The admissibility of 
the procedure based on the Normal Curve may, of course, also 
depend upon the type of statistical enquiry under consideration ; 
the application of the normal, or the Poisson, theory to a distri- 
bution shaped as in the preceding diagram might be uncertain in 
respect of the calculation of some particular ordinate or an area 
on one side only of the mean, but yet might well give a close 
approximation if an area of the curve were required in order to 
examine certain limits of both positive and negative deviations 
taken together (see p. 310; C; 14). 


Edgeworth’s Generalized Law of Error 

Another method of taking into account the terms neglected 
in the derivation of the normal forms (10) and (11) was developed 
in England by Edgeworth, and may be noted conveniently at this 
point on account of the manner of its approach, although chrono- 
logically it succeeded the basic investigations of the Scandinavian 
school (see p. 167 ; A; 12). The Generalized Law of Error which 
Edgeworth reached (see p. 234; B; 16) may be written 







kt 

(/+2)! 







c* 


....(58) 


where ^ 2 , . . . are constants and D represents differentiation 
with regard to x. 


When all the are zero, this expression obviously reduces 
at once to the Normal Curve. 

If, however, we retain ki, but neglect k%y etc., the result in the 
form used by Edgev.wth is (see also p. 205; B; 5) 


yx- 


1 - 4 ’ 


c\/t 




2 

3 cV. 


where j = 




. (59) 


Similarly, retaining terms with ki, kt, and ife*, but neglecting 
the rest, the approximation becomes (see H:16lB:45, and p. 206; 
B; 5) in Edgeworth’s notation 
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1 




+/ 


C's/v 

5 




5)+<(|- 


** . 20 . 8 


+ 10 - - 


3 c«’^ 9 






.(60) 


where 


i= 


M3 j . 1 

— and i = — 
4 



The practical application of these curves is discussed briefly 
on p. 311; C; 15. For later comparison with the Gram-Charlier 
Type A series it may be useful to note (see P:5;^:133) that by 
expanding the exponential Edgeworth's form (60) can be put as 

- {ti(A) + .81649658 V^T4(A) +.98601330 jSir, (A) + . . . 

O’ 

+.45643546 (/3j-3) tj(A) + . . . } .... (61) 

2 

where JV is the total frequency, /5i = ^ » ft = ^ i and Tn+i(A) 

M2 M2 

s= i — ( e 2 jas used in ‘Tables for Statisticians*' 

\/n! <iA«V2\/jr / 

(P:S7:Part II). 


The Gram-Charlier (Type A) and Poisson-Charlier (Type B) 

Series 

Edgeworth's general expression (58), as noted on p. 234; B; 16, 
can be established as the distribution of a magnitude depending 
on a number of independently varying elements, and as such can 
evidently be written (see P;i55:168-172) 

1 ^ 

where As, A4, As, . . . are constants, and (p{x) = — 2 <r» . 

(rv27r 


This method of using the “normal" function, <p(x), as a 
generating function of a genes was first applied to the represen- 
tation of skew frequency distributions by Gram, of Denmark 
(H:^P)^ and has since been developed by several Scandinavian 
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mathematicians — notably Charlier (H-JOO), Wicksell 
and Jdrgensen The values of the constants have been 

determined by several methods (see p. 235; B; 17), and in terms 
of moments enable (62) to be expressed as 


yz=<p{x)- 


1 


3! 



Vt (x) + 


Jl_ 

4! 

1 

6! 




....(63) 


where <pn(x) is written for 


d^<p(x) 

dx^ 


These derivatives of <p (x) are 


easily obtained, and their values are available in many standard 
tables and texts (for example, in P:^7, P:47, and Fill 4^.209; see 
also P:5d:214 and 280-1). This expansion is known as the 
Gram-Charlier, or Type A, series. 

For comparison with Edgeworth’s series, and in computation, 
it is convenient to put (63) in the form (see P:S;8:130) 


- { ti(A) + .81649658 VJi u{h) + .46643546 (02-3) n(h) + . . . } 

....(64) 

in which the notation follows that already stated for (61). 

The practical utility of the Gram-Charlier Type A series, 
which is thus based on the use of the Normal Curve as a gener- 
ating function, is evidently dependent upon rapid convergence 
in order that a few terms only shall be required. 


For cases of marked skewness, Charlier has also employed as 
a generating function the Poisson exponential (55), which can 
assume a very skew shape, instead of the symmetrical Normal 
Curve, in the form 


where 

^(x)=e' 




sin vx 


n 


m 


+ 




T lx l!(x~l)^2!(jc~2) 
which in the limit when tn is an integer becomes 


L...(65) 


xl 


, and where also 
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This expression proceeds by differences instead of derivatives 
because the Poisson exponential is a discrete function (see also 
P:Sff:268-9, and P:J?4^:38). It is usually called the Poisson- 
Charlier, or Type B, series. 

Four different methods of fitting are suggested by Charlier, 
for which the formulae are easily accessible in P:5^:131-2 (see 
also P:Sff:271 et seq.). References to practical illustrations are 
given on p. 311; C; 16 here. 

Pearson’s System of Frequency Curves 

Many frequency distributions which are encountered in prac- 
tice are either symmetrically or unsymmetrically bell-shaped, 
and thus rise from zero to a maximum and thence decrease. 
This characteristic has led to the extensive use of a highly valu- 
able system of frequency curves which was originated and devel- 
oped by Karl Pearson. For if a curve is asymptotic to the 

:r-axis (i.e., has '‘high contact”) at one end we must have — =0 

dx 

when y=0; if the maximum occurs at, say, x=—a, we must 

again there have^ =0; and accordingly a very general expres- 
dx 

sion for such a frequency distribution (which may evidently 
include other types of curves as well) can be written down 
immediately as 

dy_ y{x+a) ^^ dlog y _x+a ....(66) 

dx F{x) ’ dx F{x) 

As an alternative method of approach it may be recalled that 
the Normal, Skew-Normal, and Poisson expressions were ob- 
tained as developments of the "point binomial”, in which the 
probability of success, p, remains constant. If, however, this 
probability is not constant, but depends on the previous occur- 
rences in a set of trials, we must use a different probability series, 
namely, the hypergeometrical^ based on the supposition of draw- 
ing, say, r balls one at a time, without replacements, from a bag 
containing np white and nq black balls. The probability that 5 
balls will be white out of the r balls so drawn is 
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”Pr ^ """ "Cr 

Putting then 5 = 0, 1, 2, . . . , r we obtain the ordinates of the 
distribution at unit intervals — the series taking (see P:5;^:39-41, 
and P:jff^:50-3) the general form 

x+a 

y dx bo+biX+biX^ 

This expression is that already written down at (66) when F(x) is 
expanded in ascending powers of x. 


The various curves which arise from the integration of (67) 
evidently depend upon the forms taken by the denominator, i.e., 
they depend upon the nature of the roots of bo+biX+b 2 X^=Of for 

which the criterion is, of course, — . Too much space would 

4 & 0&2 

be required to give the integrations here. They are, however, 
quite straightforward, and are available so clearly in Elderton's 
book (P:5;^:38 et seq.) that actuarial students may be referred to 
it without hesitation. 

In Pearson’s numbering there are 13 curves, called Types I 
to XII, and the Normal curve. Types I, IV, and VI are the 
Main Types; the others are Transition Types which arise as 
limiting cases when the main types change into each other, and 
embrace not only the Normal Curve but also a straight line, a 
geometrical progression, and J-shaped, twisted J-shaped, and 
U-shaped curves. 

The distinctions between the three main types and the ten 
transition types are indicated in the following table, where, for 
ease of reference, Pearson’s own numbering is used and the 
sequence adopted by Elderton in 'P-M is retained. 


Typical shapes of the various curves are next shown for cer- 
tain positive values of the parameters. 
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Number 
of Type 



Shape; and whether Limited 
(in both direction!). Limited 
One Way (i.e., in one direc- 
tion only), or Unlimited (in 
both directions) 



Main Types 


y =y<i{x —a)^'x~^' 


Transition Types 


Usually skew bell-shaped. J- 
shaped when pat negative. 
Twisted J-shaped when both 
pai and pat are arithmetically 
< 1 and one of them nega- 
tive. U-shaped when both 
pax and pat negative. Limited. 

Skew bell-shaped. Unlimited. 


Usually skew bell-shaped. 
J-shaped when q% negative. 
Limited one way. 


Normal 

Curve 



y=^yQe 


'=^«(i+S) 


r=yo(l+*y 




-n(l+f) 


y=yoe 


y=yoX- 


/ ai+Jc V 

'\ai—x) 


Symmetrical bell-shaped. 
Unlimited. 

Usually symmetrical bell- 
shaped. U-shaped when m 
negative. Limited. 


Symmetrical bell-shaped. 
Unlimited. 


Usually skew bell-shaped. 
Geometrical progression (ex- 
ponential) when -ya — 0. J- 
shaped when ya <0. Limited 


Skew bell-shaped. Limited 
one way. 


From infinite ordinate at 
xtm—a to finite ordinate at 
X —0: m lies between 0 and 1. 
Equilateral hyperbola when 
m -1. 

From zero ordinate at z — — o 
to finite ordinate at z "iO 
when m>0; straight line 
when m «1. 

Exponential (special case of 
Type 111); Laplace's "First 
Law of Error'^ (see p. 159; 
A; 4). 

J-shaped (special case of 
Type VI). 

Twisted J-shaped (special 
case of Type 1). 
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Figure 16. — Transition Type XI 
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Since the form of the curve depends upon 


JL 


, the selec- 


tion of the appropriate type in any particular case can be made 
readily by computing the numerical value of this criterion. In 
terms of moments it may be written easily (see P:S;S:41"45) as 


^ where |8x= ^and 


( 68 ) 


from which the curve can be selected by the table in F:SS:5l. 
Alternatively, the type can be chosen by means of a diagram, to 
be found in P:S7, showing the regions of each type for values of 
and fit. Another rapid method is to calculate from the data 
the values of A* log y, whence the type may be indicated as sug- 
gested in P:ffl :60. 


The fitting of these curves to statistical data is accomplished 
by. the “method of moments^, which was developed by Karl 
Pearson for that special purpose — see p. 97 here. 

• 

Their practical applicability in actuarial work has been ex- 
plored extensively, and is discussed at p. 312; C; 17. 
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Further Modifications of the Normal Curve/and of Pearson’s 

System 

Although they are not of immediate practical importance, it 
may be well to note here — ^as a matter of theoretical interest 
only — that several other methods of representing frequency dis- 
tributions have also been investigated. 


(i) It may sometimes happen in dealing with certain statis- 
tics that the frequencies for a variable x do not accord with those 
of the Normal Curve (11), but that the corresponding frequencies 
for some function of x, say/(^c) = 2 , may be so distributed. Under 
those circumstances it follows, as shown at p. 236; B; 18, that the 
transformed frequency function is 

1 [fix) ]« 

<p{x)dx = — 7= fix) e dx .... (69) 


This device for dealing with frequency distributions was used 
extensively by Edgeworth, who called it the Method of Translation 
(see H:f5;?:65), and has been explored also by Kapteyn (H:Pf), 
Wicksell, and Rietz (H:JfS5, 161, andJf7^). One important case 
which has been widely discussed (for bibliography see F: 175:73) 


results from the transformation z = -4- log ( - — ^ J , and leads 

V 2 \b/ 

(see p. 236; B; 18) to the logarithmic frequency function 




c-\/2ir(x—a) 




..(70) 


The properties and practical application of this distribution are 
examined in F:176. 


(it) The logarithmic frequency function (70), or an equi- 
valent form, has also been employed as the generating function 
of a series by the Scandinavians Charlier, Jorgensen, and 
Wicksell. 

(Hi) Romanovsky (H*.150) has explored the possibility of 
using still other generating functions, such as Pearson’s Types 
I, II, and III (cf. P:116:75 and P:lU-n6). 
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{hi) Carver has suggested the use of a difference 

equation, instead of Pearson’s differential equation (67), in the 
form 

Ay _ y(a-x) ....(71) 

Isx bQ+hiX+b2X^ 

The resulting formulae, with examples of the graduation of fre- 
quency distributions and stumps of such distributions, are given 
in 11:129 and P:114:in. 

(v) Since the *‘mode” (the position of the maximum ordinate) 
is often of primary importance in economic data, Mouzon (in 
li:167) has given the curves resulting from assuming that the 
value of the constant a in (66) is first determined from the 
observed data and equated to the value of the mode in the 
theoretical distribution, and that the polynomial in the denom- 
inator is of the third degree or lower (instead of being taken of 
the second degree as in the fundamental equation (67) of Pearson’s 
system). 

{vi) Hansmann (H:^^;^) also has examined the six main and 
fourteen transition type curves which emerge from expanding 
F{x) in (66) to the fourth degree, with zero coefficients for x and 
so that F{x) is taken as bQ+biX^+b^x^. His conclusion is that 
these fourth order symmetrical curves fitted by moments give 
improved results, and so justify the additional work entailed. 
It must be remembered, however, that moments up to ms are 
involved, with large sampling errors, and that in practical curve- 
fitting work it is usually preferable to avoid the use of such high 
moments (cf. P:i 4^:754). Heron previously had taken F{x) as 
far as b(s+biX+b 7 Pc^+bzO(^, but had not published his investiga- 
tions because the additional bzx? term did not seem to effect any 
practical improvement over Pearson’s curves. 

{vii) Pearson’s curves are fitted by the use of moments up to 
M 4 (see Chapter VIII). R. A. Fisher (P:57:355) has observed that 
the system of curves for which the method of moments thus 
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applied is the “best” method of fitting would be given by an 
exponential with a fourth degree argument, namely, 


y o*(ao+aiJc+aj3c*4-a»x»+*<) 


....(72) 


— “the convergence of the probability integral requiring that the 
coefficient of x* should be negative, and the five quantities a, oo, 
tti, ai, 03 being connected by a single relation, representing the 
fact that the total probability is unity” (see p. 237 ; B ; 19). Put- 
ting a* = 1 (since a depends only on the unit of measure of x), and 

replacing ac by x — — , this exponential takes the form 
4 

y = e- = Ce~ where C = e"'* .... (73) 

Since Pearson’s curves cannot produce a double hump, and it is 
doubtful whether Edgeworth’s or the Gram-Charlier Type A can 
do so conveniently (cf. P:S;^:139 and 140), it is interesting to note 
that the classes of curves which arise from (73) are typically bi- 
modal. They have been investigated by O’Toole, who shows 
an example (li:180:2S) of a double-humped distribution which 
can be so represented (see also P:^5:115). 


Other Curves of Interest to the Actuary 

From the preceding it will have been realized that not even 
the very varied capacities of the Normal, Skew-Normal, and 
Poisson curves, or of the Edgeworth, Gram-Charlier, Poisson- 
Charlier, and Pearson and other generalizations, will always 
handle some of the frequency distributions or series of statistical 
ratios with which the actuary is specially concerned. A great 
deal of investigation and ingenuity, in fact, has been devoted for 
many years to the invention of other curves which might be 
suitable for actuarial statistics. This chapter would therefore 
be seriously incomplete unless some attempt were made to cata- 
logue the various proposals which have been advanced — although 
space will not permit in many cases more than a statement of 
each formula, with references to sources where additional details 
may be found. 
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(1) Frequency Distributions 

For the representation of frequency distributions proper, such 
as the “exposed to risk” or deaths, Hardy has pointed out (P:5i : 
50) that, when m and n are numerically unequal, a skew curve 
vanishing when jc = —a or —6 is given by 


_ (Jl^ -i. 

y=.ke ( 74 ) 

while he has also suggested for the same purpose 

(75) 


where x represents a proportionate part of the range of the curve 
so that X varies between 0 and 1 (see P:51 :135). 


An interesting and much earlier attempt to find a “law” 
(cf. p. 168; A; 13) followed by the “entrants” into mortality 
experiences led Chandler (H :45) to the empirical expression 


—ac* sin x(p 
y —ab c 


....(76) 


for the representation of a skew frequency distribution. 


(2) The Curves of and log 

Much attention has been given, notably some years ago (see 
p. 168; A; 13), to the possibility of discovering a formula which 
would describe adequately the twisted descending curve (as indi- 
cated in Figure 18) of — the “number living” in the hypo- 
thetical “life-table”, which results from multiplying an arbi- 
trarily selected number of births, /o, by the successive values of 
the probability of survival px{ = 1 — 3*)- Since these explorations 
are of value now only to the extent that they have led in certain 
instances to workable expressions for the force of mortality 

Ma:^ = — it must suffice here to refer to Elston's paper, 

» for the record of most of the various attempts. It should 
be noted, nevertheless, that some of the curves which have at- 
tained wide recognition — especially Makeham’s first and second 
formulae, (83) and (84) — have often been fitted to the data in 
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their form, and that Hardy (P:W:88) gives an illustration of 
the application of those formulae to enumerated census popu- 
lations. 

Another curve used with success by Hardy (loc. cit., 89 and 
68) is the representation of log /*, or the logarithms of the 
numbers living above age by 

y = (77) 

and the form y —A +jEZix:+jBc*+C^c2 (78) 

was employed for the populations and deaths at and above age x 
in the life tables constructed for the British National Insurance 
Act, 1911 (seeH:fj?f:564). 

For the years of infancy and childhood up to age 12, Hardy 
has also used 

/. +Hx-\-Bc^+ .... (79) 

nx+1 

where the last term gives effect to the heavy mortality of the 
early ages and thereafter becomes practically negligible (see H : 
and 391). 

This problem of evolving a formula which would be able to 
take into account the rapid change in the values during the 
infantile ages has more recently been examined again by Stef- 
fensen (H:f^5), who suggests that the ^^uniform seniority** pro- 
perty (see p. 319; C; 18) of the Makeham function (83) can be 
preserved by adding at the infantile ages the expression 

oV *+& 

log/, = 10 ....(80) 

It has also been pointed out (H:18S) that even more satisfactory 
results may be obtainable if (80) is modified into 

cV x+b-^kx 

log/, = 10 +d ....(81) 

(3) The Curves of and colog pg 

Of all the formulae which have been proposed, however, those 
of the greatest practical importance to actuaries are of course 
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the expressions which endeavour to represent the rate of mor- 
tality by age x, namely, qxi= ~ ) , or — more easily 

\ lx Iz' 

— the force of mortality ^ central death rate 



I or cologio/>« or logioAt^. The numerical values of 

these ratios, naturally, are not like bell-shaped frequency dis- 
tributions; from infancy to old age they follow a contorted 
U-shaped curve, with the minimum in the neighbourhood of 
age 11, so that from about age 11 to the end of life the curve 
increases steadily (often with two minor undulations in the 
thirties and seventies) with its convexity towards the a:-axis (cf. 
P:10^:41t 45, and 55, and H:/4^:88). If, then, the values up to 
about age 10 are dealt with separately, the remainder of the curve 
from the region of age 11 upwards will usually be found to change 
slowly at first, and at the older ages to resemble more nearly a 
geometrical progression — a circumstance which means that often 
the logarithms of the values there approximate to an arithmetical 
progression. The points of inflexion, however, introduce great 
difficulties into the problem of finding any expression which will 
represent mortality rates over the whole period of life. A typical 
curve for Qx is shown in Figure 19 on the next page. 


For a detailed catalogue of the many early attempts to find 
a “law*' (cf. p. 168; A; 13) of mortality in mathematical form 
the reader may again be referred to as it will suffice to 

include here only those which have proved to be practically 
useful. The first which attained wide recognition was Gompertz*s 
simple geometrical progression {¥1:18) 

Hx from which /* = feg®* (82) 


Next followed Makeham's First Modification (H:Sf:303) 
where the first differences of /x* follow a geometrical progression, 
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+J5c®, whence and m, or colog />x = a+/5c* 

....(83) 

which possesses the valuable property of ‘‘uniform seniority’* for 
the actuarial computation of joint-life annuities (see p. 319; 
C; 18). 

Later (H:^5:191) Makeham's Second Modification was sug- 
gested, with the second differences of nx following a geometrical 
progression as 

IXx—A +Hx+Bc^, whence — and nix or colog px = 

a+yx+^c^ (84) 

which leads to a less convenient method of “uniform seniority’’. 

These formulae are, in fact, particular cases of an interesting 
generalized expression given by Quiquet {W:72, and H:^^), 
namely, 

log lx^A+Bx + ^e^^^U{x) . . . .(85) 

t 

where /,(5^) is a polynomial, and the constants r,- are the roots of 
^o+.4ir + . . .+-dnr” = 0. Several other formulae which have 
been suggested by various authors may also be derived from (85) 
when w = 0, 1, 2, or 3, as is indicated in H:f^7:85 (cf. also P: 
102 -M). 


From Gompertz’s formula (82) we have 


logeMx = log«B ^x\ogeC=a+hx where a =loge5 and 5 =log«c, 


so that the formula can be put into the exponential form 

With a view to introducing greater elasticity, H, L. Trachtenberg 

(P:i44) extended this into three improved expressions 


or /X 


....( 86 ) 

....(87) 

....( 88 ) 


and pointed out that the last form provides for the two points of 
inflexion which are often found in the curve of log /x*, since (88) 
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arises from — Iogiu*=a (jc—i) (ic— (;) where the points of in- 
dy? 

flexion are at h and c (see also P:f0;^:45 and 66). 

Another method of generalizing Makeham’s formula has been 
examined and illustrated by J. Buchanan in the form 

colog px=K +Lr^ cos {xd^ —O q) (89) 

and by G. J. Lidstone (P:S5:419) as 

jjLx = (A +Bc^) +Mxc'^ .... (90) 

and iXx=A+{B cos x<p-\-M sin X(p)c^ . . . .(91) 

Noting that many attempts to employ Makeham’s first 
expression (83) gave values too high at the older ages, and em- 
phasizing the importance of being able to take account of the 
points of inflexion, W. Perks {P:102) has investigated the ex- 
pressions 

-M1-=A+Bc* ....(92) 

1-g, 

A +Bc* 

ix or nx = (93) 

l+Pc* 

, A+Bc* 

and Qx or w* or ^ « * • * • 

Kc *+l+Dc^ 

It was pointed out by G. F. Hardy (P:5f:68) that formula 
(77), when used for log /* so that it means that 

Hx—Bc^+Mn^ . . . .(95) 

preserves a modified * ^uniform seniority*’ principle. With the 
addition of a constant, so that 

iix—A + Mn* (96) 

(which is sometimes referred to as the ’’double geometric law”), 
the application of the modified seniority method is applicable as 
shown in P;55:413. 

This ’’double geometric” curve is the Makeham expression 
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(83) with another geometric term added. The incorporation of 
still another such term leads to a “triple geometric” expression 
= {A ■\-Bc^)-\-{Mn^+Rr‘) .... (97) 

which has been discussed and illustrated in 57:539 and F:94. 


Another formula which has attained some prominence in 
practical work as a means of representing mortality over the 
whole range of life is Wittstein's expression (H;55:164) 


m 


....(98) 


for which the reader may be referred also to and P: 

59. 


(4) The Curves of and 

The mathematical representation of the curtate or complete 
“expectation of life“ (e* or e^) has received some attention — 
firstly, because the “expectation of life“ has captured unjusti- 
fiably (see P:164-^01, and P:171 :281) the minds of many laymen, 
and secondly, because it may be claimed that in the calculation 
of that function a certain amount of graduation has been impli- 
citly done (11:110:93). Being a gradually decreasing curve, it 
can often be reproduced closely (as suggested by G. F. Hardy^ 

P .61 .79) by Cx^a +bx +cx^ +dx^ +fx^ (99) 

Several other forms which have been used are noted in P:f ^5:153. 


Steffensen (11:106 and ¥1:127) has experimented also with its 
reciprocal (to produce an increasing series — see also P:102-A0) in 
the Makeham form - 

f- =A+Bc^ ....(100) 

(5) The Curves of ^ ^ » log ^ i log ^ > etc., with Reference 
f» T, T, />, 

to a Standard Table 

Instead of attempting the direct representation of functions 
such as Ixy or T* (the population at age x and beyond), a useful 
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device sometimes is to deal with the simplified curves which 
depict the ratios of such functions to their corresponding values, 
or T'x, in a standard table. Thus it was found by H, G. W. 
Meikle (JF:88 and 1^:168) that the census data for certain sections 
of India in 1921 could be represented by fitting 3rd degree para- 

T IT. 

bolas to values of log — f ; the use of and — ^ is noted in 

X '^x 

P:f ^7:101; and the advantages of log % as a function of small 

Px 

values which progress slowly is indicated in F:81 :213. 

(6) The Curve of Sickness Rates 

An early examination of the possibility of representing rates 
of sickness, Sxj by an analytical formula led Makeham (H :^7) to 
suggest again the use of the function A which Hardy 

(P:5I:90) accordingly has noted might be applied in the form 

log {N-Sx) +Bc^ .... (101) 

where N is 52 or a value determined by trial. Discussions of the 
possibility of graduating sickness rates by Makeham’s formula 
are also to be found in H:5S and H:177. 

(7) Curves for the Retirement and Depreciation of Physical 
Property 

Though of course outside the realm of this study, it is inter- 
esting to note that a number of the methods outlined in the 
preceding sections of this chapter for representing mathemati- 
cally the mortality of human beings have been applied (see H :176 
and li:184) also to the estimation of the rates of depreciation and 
retirement of physical property (such as telephone or telegraph 
poles, cables, coils, etc., railroad culverts, cross-ties, cars and 
locomotives, automobiles, electric power equipment, etc.). 

(8) Curves of Population Growth 

Another problem of some interest to the actuary, but often 
of more concern to the vital statistician and economist, is that 
of computing ^'intercensar* or ‘*mean*^ populations between 
successive census enumerations, and of estimating — i.e., **pre- 
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dieting*’ — future populations for groups, localities, countries, etc. 
(The questions which arise in preserving consistency between 
such calculations for the constituent parts of a total and for the 
total itself are incidental to the main problem, and need not be 
discussed here; they are dealt with in P:jf^i:219, P:i57:63-67, 
Y*\171 :283-289, and V:105 :732.) As an alternative to the assump- 
tions of arithmetical or geometrical progression, or of a parabolic 
trend (with an assigned order of differences constant) — which are 
useful particularly for the estimation of values lying between the 
known values of successive censuses (for which see the references 
just given) — much discussion has surrounded a curve-fitting 
method based on the Verhulst-PearUReed (otherwise known as 
the Logistic) curve of population growth (see p. 169; A; 14). 
According to this theory the *‘law” of growth over time, t, of a 
self-contained population, Pf, undisturbed by migration should 
(see p. 238; B; 20) be capable of representation as 




....( 102 ) 


where r denotes the abscissa of the point of inflexion of the curve 
(which follows the symmetrical shape of Figure A6 at p. 197 ; B ; 3) , 
A and B are the ordinates of the asymptotes, and is a constant. 


Pearl and Reed have also suggested (see P:^?^:575 and V \105) 
that the essential symmetry of formula (102) can be modified, 
in order particularly to make allowance for cyclical growth, by 
using the generalized form 


P<=d 




....(103) 


The ^‘logistic” curve within recent years has been fitted, by 
several different methods, to a wide variety of populations (see 
p. 320; C; 19). It is undoubtedly capable of representing many 
series of known past populations with reasonable accuracy — 
particularly in the earlier*stages of a population’s growth, when 
the curve and the data often follow closely a geometrical pro- 
gression (cf. Y\17G'!1 and 46). The influences which determine 
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the progression of a population, however, in reality are numerous, 
varied, and unstable; they are actually much more complex and 
unpredictable than are the simple principles on which the sym- 
metrical logistic curve is based. The formula accordingly cannot 
have much claim to be accepted as a ‘‘law*' of growth, and it is 
now generally conceded that its application to the problem of 
long-range prediction is surrounded inevitably by all the uncer- 
tainties of extrapolation (cf. P:^5:110). 

(9) Curves for Forecasting Mortality 

A natural development of the preceding methods of extra- 
polating population data is their application to the forecasting 
of mortality rates. Since changes over time in the mortality 
rates of measurable populations are usually gradual, it has gen- 
erally been found sufficient to exhibit the relationships between 
and qx (where z and a denote calendar years) on some simple 
basis like a parabola, a geometrical progression, a Makeham 
expression, or a ^‘logistic** curve. Such applications therefore 
need not be detailed here, as they do not involve any new curve- 
fitting processes. The history of their evolution, however, is 
indicated at p. 169; A; 15, with references to the literature. 



VIII. THE FITTING OF CURVES, AND GRADUATION 

In order to appreciate the importance of the methods of 
mathematical statistics in connection with the problem of 
* ‘fitting” curves and “graduating” data, it will be of assistance 
to remember that the latter questions are related essentially to 
that of “random sampling”. 

For suppose that we haveasetof data/^,/^, . . . such as 
the deaths 6 ^, ^2» • • • » 1 to v, which have been obtained 

by observation of a supposedly homogeneous group of men “ex- 
posed to risk”. Even under the assumed conditions of homo- 
geneity this series, having been secured by observation of a 
group necessarily limited in number, will of course contain irre- 
gularities attributable to that limitation. The observed series 
/i»/2» • • • »/vfifit is to be viewed as unbiassed, must consequently 
have been brought into existence by some method of random 
selection operating upon a hypothetical “parent population”. 
Any attempt either to “fit” a mathematical curve to such a 
series, or by some other process to remove the irregularities due 
to the paucity of the data and the method of selection, is there- 
fore in effect an attempt to determine the hypothetical “parent 
population” ofwhich the observed . . . ,/v can be considered 
to be a random sample. 

Now the ideal accomplishment would obviously be the deter- 
mination of the “true” parent series, which may be called 
/i» fit ••• f fy] and this determination must indeed be the real 
objective, in the sense here discussed (cf. p. 239; B; 21). If, 
however, a curve is “fitted” to the given series /i, ...,/„ in any 
such attempt to find the true “parent” series /i, . . . , it is 
obvious that the best we can do is to determine, on some criterion, 
a series of “fitted” values specified by the calculated parameters 
a, 7, ... of a fitted cufve yx=f"(x; a, 7, . . .) of appropriate 
fornit by which the observed series will be represented, and from 
which the characteristics of the parent series may be estimated. 


89 
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With such a curve of appropriate form the problem is conse- 
quently one of (a) determining the unknown parameters 
ttr 13, y, . . , ; (b) thence computing the '^graduated” values; and 
(c) applying tests of ^^goodness of fit” to the whole curve, in order 
to decide, in conformity with the hypotheses, whether the results 
should be accepted. 

It is thus important to realize that there are three concepts 
in the problem — the hypothetical ‘"parent” series/i, . . ,/p which 
is unknown, the actual series f f I which has been “ob- 

served”, and the “fitted” or “graduated” series f\, . . . , f[, 
say, which is to be determined. In the ideal state — unlimited 
material, complete homogeneity, absolute randomness, and 
perfect fit — the three would be the same. Under the limitations 
of practice, however, it is necessary to formulate some process 
of fitting the graduated series/ 1 , ... ,/p, through determination 
of its parameters a, j3, 7, . . . in yx=f\x; a, /3, 7, . . .), so that 
certain conditions arising from the theoretical requirements will 
be satisfied. The methods of establishing such processes of 
fitting will now be discussed. 

The Principle of Maximum Likelihood 

As explained on p. 39 (Chapter V), and at p. 239; B; 21, the 
specification of the “parent” series /i, . . . , /p from the observed 
data/i, ... ,/p can evidently be based logically upon the principle 
of maximum likelihood, by which the greatest possible value 
is assigned to the probability of the observed data having been 
drawn by chance (cf. p. 239; B; 21). When we come to the 
problem of determining a series of “fitted” values, f\, . . . , /^ 
which may be considered to be the best possible representation 
of a series of observed data f[, . , . , fl, it must therefore be re- 
membered still that the characteristics of the parent series 
/i, . . • , /v are usually unknown; it is consequently necessary to 
assume that we are fitting a curve which is, in fact, of appropriate 
form, so that the fitted values (fr) will be estimates of the true 
parent values (/r). Under these conditions, accordingly, we can 
view the deviations,/'^—/,,, between the^fitted and observed series 
as being due to chance fluctuations arising from the limited size 
of the sample of /^’s which has been drawn from the parent //s. 
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The Method of Least Squares 

Since the deviations can thus be considered to have 

arisen from chance, they may be assumed to follow the Normal 
Curve (11), in which the parameter c will depend on the parent 
population frS from which the sample is drawn, and is accord- 
ingly independent of the observed /'’s and the fitted frS. It 
will therefore be clear that c will be a constant, either known or 
to be estimated, for any given process of selection by which the 
observed y^’s are derived from the parent /r's. 


If now we assume, firstly, that c is the same for each of the 
observed /,.*s, i.e., that every one of the//s has been determined 
by an equally good process of selection, then the probability of 
any particular deviation occurring, by (11), will be 

e , and the probability of the whole series of 

C y/'K 

deviations (fi-/,) occurring together will be 



....(104) 


In the more general case, when c is not the same for each of 
the observed /r's, i.e., if the/^’s have not been equally well deter- 
mined, suppose that the methods of selection have been such 
that Cr is the value appropriate to/^; then the probability of any 

particular deviation by (11), becomes — ^ — e , 

Cr VtT 

and the probability of the whole series of deviations is 



Employing now, in connection with this general case (105), the 
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same principle of “maximum likelihood’' as before we see that, 
in order for this probability to be a maximum, 




s 

r-1 



must be a minimum 


....(106) 


This is the simple fundamental condition of the Method of 
Least Squares (see p. 170; A; 16). The name arises, obviously, 
from the Tact that the condition (106) requires that the least 
possible value be assigned to the sum of the squares of the devia- 
tions between the fitted and observed values, each divided by the 
square of the appropriate parent parameter Cr> 


If, now, be called the weight appropriate to the observed 
Cr 

fr, and if it be symbolized by Wr, we have 

....( 107 ) 

It also follows from (18) or (19), since c, r;, <r, and X are all propor- 
tionate, and because any constant factor can be omitted from the 
minimization of (106), that in practice we may take for Wr 

( in addition to \ or — ] the very usual forms (as they are often 

Cf Tff/ 

Stated, and where X denotes the “probable error’’) 

W^r=^or-\ ....(108) 

a; 

The principle of Least Squares consequently may be formu- 
lated in the simple statement that 

be a minimum (109) 


The "Normal" Equations 

When, therefore, the fitted values are to be those given by 
a curve y* =/ (*; o, j8, y, . . .), in which the parameters a, j9, y, . . . 
have to be determined, the condition (109) means that 
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a, /3, 7, , . niust be minimized (where now 

the more usual variable x is written instead of r, and it will hence- 
forth be understood that the 2 indicates summation over all the 
values of x which are included in the data). This evidently will 
be effected (cf. 1?:61 :118, footnote) if we equate to zero the partial 
differential coefficients with respect to the unknowns a, /3, 7, . . . , 
and solve the resulting equations, which then will clearly be the 
same in number as the unknowns. 

The application of this very simple and logical principle may 
be illustrated first for the fitting of a general equation of the nth 
degree, y* = a+/3jc+7:x:^+. . . . We then have to minimize 
^[W»{(a+fix+yx^+. . .)“"/«p]- The partial differential co- 
efficients with regard to a, /3, 7, . . . respectively, when equated to 
zero, give immediately (the common factor 2 being omitted) 

AW, {(a+0x+yx^+...)-fl}] = O 
AxW, {(a+/Sje+7**+. . .)-/x}] = 0 . . . .(110) 

A »^ W ,{( a + Px + y ^+. . .)-/;}) =0 
etc. 

These Normal Equations, as they are called, are in this case 
simultaneous equations linear in a, i3, 7, . . . , which therefore can 
be solved easily. It will be seen that they can be written down 
at once by a rule which is often stated in the following terms: 
‘'Set down the ‘observation equation* {a+Px+yx^+. . 
for each value of x, noting its weight. Form the normal equation 
for the unknown a by multiplying each observation equation by 
the coefficient of a in that equation, and also by its weight, and 
adding the results; similarly form the normal equation for jS by 
multiplying each observation equation by the coefficient of /3 in 
that equation, and also by its weight, and adding the results; and 
likewise form a normal equation for each of the other unknowns, 
7, etc. The solution of these normal equations in the usual 
manner will give the ‘best’ values for the unknowns a, |3, 7 , . . . ” 
(see also p. 322; C; 20). 

This rule for the formation of the normal equations in respect 
of a parabolic function yx — a+Px+yx ^+. . . is very simple, and 
leads to normal equations which can be solved immediately since 
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they are linear with respect to the unknowns a, jS, It has 

assumed much importance in the practical application of the 
method of least squares, because it has also been employed widely 
in a method of successive approximation for cases when the curve 
to be fitted is not parabolic and the resulting normal equations 
are not linear with respect to the unknowns. Under these circum- 
stances the classical method is to find, firstly, approximate values, 
say, a', iS', . . . , of the unknowns a, jS, . . . , and then to suppose 
that a = a'+6a, , where the corrections 5a, 5jS, . . . 

which are now to be determined may be assumed to be so small 
that their squares and higher powers may be neglected. Expan- 
sion by Taylor’s Theorem thereupon immediately reduces the 
procedure to that already given, with the small corrections 
5a, 5j3, . . . appearing as the unknowns in linear normal equations 
(see p. 241; B; 22). 

This method of approximation could, of course, be repeated 
until a satisfactory fit is obtained if it were found that the approx- 
imate values a', /S', ... , with their first corrections 5a, 5/S, ... , 
did not provide sufficiently good results. The numerical work 
involved, however, is considerable even when the corrections are 
determined adequately at the first attempt. In many cases it is 
therefore preferable to use other devices, such as those stated 
for Makeham’s formula and the ‘logistic” curve at p. 325; C; 21, 
where the practical application of the method of least squares is 
discussed. 

The Weights 

In the preceding statement of the principle of least squares 
it must be noted that the assignment of proper values to the 

weights, Wxt is of essential importance. Being defined as \ , 

cl 

or-^, or-“, by (107) and (108), they will be taken in practice 

in accordance with the requirements of the mathematical model 
which is in fact being used. • 

Thus in a graduation of a series of observed rates of mortality, 

by fitting a mathematical expression to the values of g* by 
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the method of least squares, we see, as explained in (iii) on p. 274; 
C; 7, that ; writing therefore TTfgi} , as in C; 7, to 

denote the weight of the observed it follows from (108) that 

....( 111 ) 

Pxgx 


Similarly, as in (iv) of C; 7, and since 

^x-¥\ ^ ^xQ.x 

ntzil^nix) ' (rtixy 


WU+i}^W{m',,} = 


....(112) 


Also, by (vi) of C; 7, 

W{cologp:\=^ ....(113) 

Qx 

Instances of the practical use of these formulae are given at 
p. 326; C; 21. 


If, on the other hand, the method of least squares were being 
applied to the fitting of a curve to a frequency distribution of 
actual deaths, 6^ (instead of to a graduation of ratios such as 

A 

g*= — , , etc.), it will likewise be seen, from (ii) of C; 7, that 
Ex 


w{e'x} = 


^xPxQx 


....(114) 


Since pxh^ at most age groups, this formula (as noted also at 
p. 293; C; 10) may sometimes be taken as 


w{e'x\ 



....(115) 


In certain cases, such as the fitting of an exponential y^c =*«•' 
to an observed series /i, it is obvious that the problem may be 
reducible to an easier foriu by taking logarithms and so fitting 
—f (^) to the series log«/*. There is one very important 
matter, however, which is frequently overlooked in the use of this 


8 
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device, but which must not be forgotten. For, from (33), with 

the notation of p. 272; C; 7, {log«/i} = , whence 

by (108) and similar notation for the weights, W {log^/^} = 
OxY W{f'x}; that is to say, the weight of log« must be taken 
as (fxY times the weight of fx- Consequently, even if the weights 
in the fitting of tofx can be taken as the same through- 

out and therefore all as unity and negligible, the weights to be 
used in the fitting of log« to log, fx will be (/*)*, and 

therefore, not being uniform, must not be neglected (cf. P:;^8:140, 
144-5, and p. 326; C; 21 here). 

In all these formulae for weights the undashed symbols indi- 
cate that the functions are, strictly, the “true” values of the 
parent population (as pointed out in C ; 7 — which conforms with 
the rationale of the fitting process explained at the commence- 
ment of this chapter, and with the consequent statement in 
connection with (107) that Cr is there the parameter of the 
appropriate parent population). For practical purposes these 
“true” values may be taken as the approximately adjusted values 
given by some simple method of preliminary graduation (cf. 
P:5f:37 and P:1S5), 

The proper introduction of these weights is of great theoret- 
ical and practical importance, for the whole foundation of the 
method of least squares rests — as has been shown — upon the 
condition (109) of which the weights form an essential part. 
They should not be assumed to be of the same value for the whole 
range of x (and therefore constant and negligible in the normal 
equations) until proper investigation has shown that assumption 
of uniformity to be justifiable. 

In those fields where the curves to be fitted lead to normal 
equations which are not linear with respect to the unknowns, a 
great amount of criticism has been heaped upon the method of 
least squares because of the necessity of then using either the 
method of approximation outlined, or some other device which 
will bring the problem to a linear form. That the classical 
method of approximation may require heavy arithmetical work 
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is undoubtedly true, though with present-day modes of com- 
putation this objection is not as serious as might appear. There 
would seem, however, to be very little justification for the extent 
to which the method of least squares has at times almost been 
shunned (see, for example, P:55:255, and, per contra, cf. ^:28A2). 
As will be pointed out later, it is often a very satisfying method 
in comparison with the ^^method of moments’*, so long as an 
appropriate device is applied to deal with any constants which 
may be involved non-linearly in the normal equations. This 
study will therefore emphasize both the theoretical justification 
and the practicability of the method of least squares. Further- 
more, it should be remembered, quite apart from its evolution 
from the theory of errors embodied in the Normal Curve, that 
the principle of minimizing the sum of the squares of the devia- 
tions (duly prepared by a system of weighting) must evidently 
be expected to produce very good results, since the squaring of 
the deviations gives the same influence to positive and negative 
departures of equal amount, and a large error has a greater effect 
than a small one (cf. P:51:119). 

The Method of Moments 

It has already been seen that the fundamental condition (109) 

of the method of least squares, namely that S Wrifr—frY 

r=i L 

must be a minimum, will be satisfied when the partial differential 
coefficients with respect to the unknowns in are equated to 
zero. This means that 

sTw^r(/;-/;)J^'l=o ....( 116 ) 

f»i L danJ 

df 

where — ^ represents the partial differential coefficient with 

dan 

respect to the unknown an (w = 0, 1, . . .) of the curve which is to 
be fitted, so that ao, ai, , . . are the several unknowns, and there 
is one ‘‘normal” equation thus derived for each of them. 

The application of this method to the general parabolic func- 
tion, yl = a+Px+yx^+. . . , has also been shown to lead imme- 
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diately to the set of normal equations (110), which, writing a = ao, 
l3 = oi, 7 = 08 , ... , may be put shortly as 

2[*»17.{(oo+oi*+o**»+. . .)-/*}] =0 . . . .(117) 

for n = 0, 1, 2, . . . . These equations are evidently based on the 
principle of momentSi by which the observation equations, duly 
weighted, are simply multiplied by the successive powers of the 
variable, i.e., by 1, x, . . . (cf. p. 253; B; 27). 

It is thus clear that, in the case of a parobolic function of the 
form y’l^ — a-^^x+yx^+. . . , the method of least squares leads to 
exactly the same equations as the method of moments, so long 
as both sets of equations are properly weighted in accordance with 
the requirements of the method of least squares. 


Reference has already been made to the fact that the normal 
equations of least squares are usually difficult of solution in cases 
other than that of the parabolic function just discussed. More- 
over, if the weights, Wx^ can be taken throughout as uniform, 
and therefore negligible, in the identical normal and moment 
equations (117) of that case, we reach a simplified set from which 
Wx has disappeared, namely, 

S { (ao+aia;+a2^* +. . .) ”“/x} ] =0 .... (118) 


for » = 0, 1,2,.... 
general, that 


This unweighted procedure would mean, in 
s [*"(/•: -/,:)] =0 ....( 119 ) 


In this unweighted form it has become known as the Method of 
Moments, and is so used very widely as an easily applicable pro- 
cess for the fitting of frequency curves and other formulae to 
observed data. 


The relationships just shown between (i) the strictly weighted 
equations of least squares, (w) the weighted equations of mo- 
ments (identical with those of least squares in the case of a 
parabolic function), and {Hi) the unweighted equations of 
moments, are highly important. They have been rather seri- 
ously overlooked, however, on some occasions, with resulting 
disparagement of the method of least squares, and corresponding 
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implications that unweighted moments may be applied with 
assurance of success under almost any circumstances. It should 
therefore be noted that, in general, the strictly weighted equa- 
tions (116) of least squares will be reproduced by the unweighted 
equations (119) of moments when, for n=0, 1, 2, , 



This relation has been examined by Steffensen in a valuable 
paper (P:i57:357), and in effect also by Hardy in :129. It is 
there pointed out that if, as is generally possible, an exponential 
function = be used to represent ap- 

proximately a frequency distribution of observed values/;^, for 

which we can take Wx~ (see p. 242; B; 23), then the weighted 

f X 

equations of least squares will lead to practically the same results 
as the unweighted equations of moments (loc. cit.). The meaning 
of this important conclusion is well expressed by Steffensen 
(P:iS7:358) in the following words: *‘Most frequency-curves can 
more or less approximately be represented by an expression of 
the form/^;[ = e®^+®i^+* • This is probably the true 

reason why the method of moments has proved such a powerful 
instrument for determining the constants of frequency-curves. 
But the same consideration is a warning against using that 
method indiscriminately, with or without weights, outside its 
natural scope. The cases where it is said that the weighted 
method of moments (yet with incorrect weights) has been suc- 
cessfully applied to curves (such as the force of mortality) which 
are not frequency-curves, will, on inspection, dissolve into cases 
where some artifice has been brought in (such as working directly 
on the exposures and deaths) by which the problem has been 
transformed into that of applying the unweighted method of 
moments to di frequency-curve. Hence the success.*' 

These observations emphasize in an important manner the 
distinction which should be kept in mind between the weighted 
and unweighted applications of the method of moments which 
have appeared in practice. From the preceding discussion it is 
to be anticipated, evidently, that (1) the strictly weighted least 
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square equations will give a “best” fit according to the particular 
(and very defensible) definition of “best” involved therein; (2) 
approximately weighted least square equations will lead to very 
good results (since the weights are in reality only relative values) ; 
(3) strictly weighted, or even approximately weighted, equations 
based on moments will produce values closely comparable to those 
of least squares for parabolic functions; (4) the weighted equa- 
tions of least squares and the unweighted equations of moments 
may be expected to give fairly similar results only in certain cases 
such as those which are in reality the representation of a fre- 
quency distribution ; but that (5) unweighted least squares should 
be applied only under circumstances which appear to be justi- 
fiable after due examination in accordance with the principles on 
which the use of weights is based; and (6) the unweighted equa- 
tions of moments should not be accepted as being necessarily 
or universally applicable, and therefore should be employed with 
some caution. 

A brief indication of the practical significance of these prin- 
ciples is given at p. 328; C; 22, in respect of certain applications 
of the method of moments in actuarial work. References to 
numerical examples of the fitting of curves by the method of 
moments may also be found at pp. 256-257; B; 27, and pp. 312- 
319; C; 17. 

The Minimum-x^ Principle 

The rationale of the method of least squares is based, in fact, 
on a mathematical model which contemplates (as explained in 
the first paragraph of Chapter VI) that each term of the series 
to be dealt with, namely,/ 1 , . . . ,/^, is independent of every other 
term, so that each deviation, fr^fr, between the fitted and ob- 
served values (and each corresponding parent value, fr) is to be 
treated separately as conforming with the concepts of the bi- 
nomial case represented by the Normal Law of Deviations (10). 

As shown in Chapter VI, however, this binomial model may 
be extended easily in order to deal with all the terms iromf[ to fl 
at once by means of the Multinomial Normal Law of Deviations 
(50). Under these circumstances it is evident from (62) that the 
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probability for the observed series, /J, . . . when the parent 
population is specified by the “true” values pu in respect 

of a total of N, is a maximum when 



is a minimum. Similarly, if it be assumed (as in the previous 
discussions of the methods of maximum likelihood, and of least 
squares) that a curve is being fitted which is of appropriate form, 
so that the fitted values (/,.) will be estimates of the true parent 
values (/r), then the observed series ... ,f[ may be viewed as 
having been drawn as a sample from a fitted series /j, ... , 
and the preceding condition for the maximum probability be- 
comes the condition that 


r (f' 

2 ^ ^ ' must be a minimum 

J 


....( 121 ) 


The application of this method to graduations which arise in 
actuarial work has been discussed by Cram6r and Wold in 
P:«S:172. 


The relation between this principle of minimizing yf, and the 
least squares condition (109), namely that S [Wrifr'-f rY] must 

r-l 

be a minimum, will be seen readily. For this least squares con- 
dition requires the use of the weights Wr \ in the case of the fitting 
of a curve to a frequency distribution it is pointed out at p. 242; 


B; 23 that Wr may be taken roughly as ; for a frequency dis- 

/ r 

tribution, therefore — being the case dealt with by the mathe- 
matical model of the Multinomial Law of Deviations — the 
method of least squares may be applied approximately by 


minimizing 




and this is precisely the minimum- 


principle just stated. It may therefore be anticipated that the 
strict equations of least squares, and the minimum-x* method, 
would produce closely comparable results in the case of a fre- 
quency distribution (see also p. 244; B; 24). 
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The process of minimizing (121), however, is usually difficult, 
for the unknown parameters in the fitted values are involved 
in the denominator as well as in the numerator, so that the differ- 
entiations become very complex (cf. p. 243 ; B ; 23). Cramer and 
Wold, therefore, have suggested (P:i8S:173) that a close approxim- 
ation will evidently be obtained if the unknown in the denom- 
inator be replaced by the observed — for the/^ series is supposed 
to have been derived as a random sample from/^, and the method 
is based on the assumption that the f function to be fitted is of 
appropriate form. The principle then would become that 




must be a minimum 


....( 122 ) 


As a practical method of approximation this may be expected 
to give results close to those of the method of least squares in the 
case of the fitting of a curve to a frequency distribution, since 

then Wx ^ ^ ^ "T noted at p. 242; B; 23, and with these 

fx fx 

substitutions the strict least squares condition (109) becomes the 
approximate minimum-x^ condition (122). An application of the 
principle in this form is noted at p. 331; C; 23. 


Other Methods of Fitting Curves 

The methods of curve fitting which have been discussed in 
the preceding paragraphs — namely, the methods of least squares, 
moments, and minimum-x^ — all make some use, although 
through slightly different formulations, of the whole of the avail- 
able information provided by the series of observed data, since 
every value from f[ to fl enters into the relations from which the 
unknown parameters are eventually determined. Several other 
processes which similarly employ the whole of the observed 
material have been suggested; but as they are all based upon 
principles less satisfying theoretically than those of least squares, 
moments, or minimum-x*, it will be sufficient only to note them, 
as of historical interest, at p. 171; A; 17. 

Other much simpler methods, furthermore, have sometimes 
been employed, which base the equations for solution merely 
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upon isolated values of the observed data, in order to use only as 
many equations as the number of unknown parameters. Evi- 
dently, however, they discard entirely all the information which 
might be derived from the data which are ignored, and thus 
cannot be expected to give anything beyond roughly approximate 
values of the parameters (cf. p. 173; A; 17). 


Graphic methods which employ ‘*semi-logarithmic” or 
‘‘double-logarithmic** paper may also permit simple procedures 
in some cases — semi-logarithmic paper being ruled on the y-axis 
according to the logarithms of the numbers (the :c-axis being 
spaced arithmetically), and double-logarithmic paper giving log x 
as well as log y (see P:;^7:224 and 237). Thus an exponential 


curve y or Laplace*s First Law of Error y = e as at 

2 


p. 159; A; 4, or a geometrical progression y =ar*, or Gompertz’s 
formula (82) when written ixx=Bc^, are all of the form log y = 
K-\'Ax ViViA consequently appear as straight lines on semi-log- 
arithmic paper ruled for log y and x. The parabola y—ax^ when 
written log y =log a+& log x, and being thus of the form log y = 
K+A log X, will similarly be represented by a straight line on 
double-logarithmic paper ruled for log y and log x. On these 
principles Gerhard (P:4^) has given a very simple method for 
Makeham*s formula (83) in the form colog />a: = a+j5c®; for the 
second term is a geometrical progression which will be a straight 
line on semi-logarithmic paper, so that, when observed values of 
coXogpx are plotted as points on semi-logarithmic paper, the 
fitting will be given when a straight line (representing jSc*) can 
be located at a constant distance a below those points. (See also 
WilSS), 

Other graphical devices are noted at p. 173; A; 17. 



IX. THE TESTS OF GOODNESS OF FIT 


When statistical data, such as observed rates of mortality 
at various ages jc, have been subjected to ‘^graduation** with the 
objects and by the methods explained in Chapter VIII, it at once 
becomes essential to determine whether such graduated values 
give a proper representation of the material at hand. 

As a practical device it may often be sufficient simply to 
examine the differences between the data and the graduated 
results, at individual ages, and in groups of ages, and for all ages 
combined, with due regard to (i) the frequency of changes of 
sign, and (ii) the standard deviations or “probable errors**. 


(i) The Frequency of Changes of Sign 


Periodic Series. The changes of sign can be examined according 
to the principle that, if the distribution of + and — signs in 
N terms of a periodic series (in which the first and last terms are 
consecutive) has occurred merely by chance, then (a) the average 


N 

number of isolated sequences of r signs alike will be ; (&) the 

number of signs which fall within groups of one or two like signs 
will be approximately equal to the number which fall within 
groups of more than two; and {c) the average number of se- 


quences of all orders (r = 1, 2, 3, . . .) will tend to the limit 
p. 246; B; 25). 


N 

~2 


(see 


Non-Periodic Series. When a series, as in the case of rates of 
mortality, is not periodic, the preceding rules may still be applied 
so long as the first and last signs are treated as consecutive (so 
that they will be considered as belonging to the same group if 
they are alike). 

If, however, the series is dealt with directly as being non- 
periodic (without the first and last signs being placed in the same 
group if they are alike), then the average number of isolated 
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iV— r — 1 

sequences of r signs alike becomes — — — ~ when the first and 
last signs are omitted (see p. 249; B; 25, and p. 332; C; 24), or 


iV'-r+3. 


2r+l 


if the first and last signs are included (see p. 250; B ; 25). 


{ii) Standard Deviations or Probable Errors 

The comparison in the light of the standard deviations or 
probable errors is effected easily, on the assumption that the 
graduated values, are estimates of the parent values, g*, by 
setting out the actual deviations between the ungraduated and 
graduated values, and comparing them (so long as 5 or is not 
so small that nq or np is less than about 10) with those expected 
according to the criteria summarized at the end of Chapter III. 
The actual deviations can then be regarded as being due only to 
accidental fluctuations if they are within 2 (or 3) times the stan- 
dard deviations, or within 3 (or 4) times the probable errors; 
when the actual deviations are within these limits, that is to say, 
the graduated rates may be accepted as a permissible representa- 
tion of the ungraduated values. Since, however, the ±3(7" or ±4X 
limits embrace over 99% of the chance deviations, and ±2 <t or 
zfc:3X include practically 95%, it will be clear that such compar- 
isons are useful merely as a test of the hypothesis that the grad- 
uated rates constitute an admissible representation — they do not 
establish an admissible graduation as a good one, and certainly 
not as a ‘"best** one. It is therefore necessary to narrow the 
limits so that, instead of examining the widest possible devia- 
tions, we may test, for example, the average deviations which 
would have arisen by chance alone. By formula (20), therefore, 
a graduation may be considered satisfactory (not merely per- 
missible) if the actual deviations (irrespective of sign) are not 
greater than approximately ^cr. Illustrations of these criteria 
are noted at p. 333; C; 24. 

{iii) Comparisons of Financial Functions 

Another practical and sometimes comprehensive test — of 
value to the actuary in view of his responsibility for the financial 



106 


The Tests of Goodness of Fit 


appropriateness of the results — is a comparison between the 

ungraduated and graduated values of a financial function of the 

/ 

basic rates, such as the annuity values a»( = 2 where 

\ <-i 

tpx ^pxpx+i • . />» = 1 - g*. , in which i is the rate 

1+t 

of interest and co is the final age^ . 

The three preceding methods, however, practically useful and 
often sufficient though they are, clearly proceed little further than 
a mere comparison between the numerical observations as they 
are given, and the graduated values obtained therefrom. They 
afford criteria of the ^‘admissibility’’ of the graduation, and of its 
“satisfactory” nature for practical purposes; but they do not 
reach the concept of a criterion for a “best” graduation, and they 
do not attempt any formal recognition of the fundamental objec- 
tive of a graduation, namely, the determination of a hypothetical 
population from which the data may be supposed to have been 
drawn by chance. We shall therefore now consider the manner 
in which a test may be based on (iv) the criterion of fit on which 
the “best’* graduation by least squares is founded, and (v) the 
function as it emerges from the theory of determining the 
hypothetical parent population. 

(iv) The ^‘Least Squares” Criterion 

When an observed series, /^, for values of r from 1 to has 
been drawn from a parent series, /r, the true error of each observed 
term is fr—fr* If a graduation were now made, and if the 
graduated values, fry were a perfect representation of the true 
values, /r, then each true error, /r~/^ would be precisely equal 
to the corresponding residual, fr^fr —Vr- The expected square 
of each true error (i.e., the mean square error) would thus be, 

a priori, <r^ whereas the equivalent squared residual, a posteriori, 

2 

would be v^. The mean value of the ratio would therefore be 

(Jr 
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unity; and for the mean values over the whole series of v terms 
we should consequently have 23 ( ~ ) == S (1) = i'. Since 

r-l \or^/ r-l 

~ =TV'r and ^^r = (/f“/r)^ this means that in a ‘‘perfect’’ grad- 
nation we should have 

....(123) 

In order to test the “goodness” of fit of a graduation which 

is not “perfect” we may accordingly say that the condition (123) 

should be satisfied approximately. It will be noted that the 
^2 

relation-— =1, or W^r(/r— /r)^ = l, expresses the fact that the 

mathematical expectation of the weighted squared residual is 
unity; and (123) states that the mathematical expectation of 

s' -/;)*] , which is the function (109) to be minimized by 

the method of least squares, is equal to the number of terms, v 
(see p. 262; B; 26 for this terminology, and cf. P:;^S:178). 


In the case of a mortality table it has been shown at p. 244; 
B ; 24 that for the least squares graduation of an observed rate 

of mortality, gi, the function to be minimized is S I > 

Lpxqx J 

which is S — — » and that the same expression is to be 

minimized in a least squares graduation of the observed deaths, 
Bx* Iff therefore, a graduation of either q'x or b'x has been made, 
a measure of the closeness of fit would be given by testing the 


agreement between 23 


- E>'xpxqx J 


and the number of ages, v. 


Since the parent values px and g, are usually not known, the com- 
putation might be made in practice by using the observed px and 
q'x as estimates of Px and g* (as in H:45:336), or values obtained 
from an approximate preliminary graduation (see p. 96 here), or 
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the finally graduated pi and ql themselves (as in H *45:323-325 
and P:5S:178). This test of a mortality table graduation was 
discussed and used extensively by De Forest as long ago as 1873 
(see p. 175; A; 18). 


The above process is founded on a comparison of the mean 
square errors over the whole of the observed and graduated 
series, and in dealing with the graduated series every term enters 
into the calculations. When, however, the graduation is per- 
formed by the least squares fitting of a curve in which there are, 
say, k unknowns, it may be argued (as in the analogous case of 
Bessel's correction (42) for one unknown and uniform weights) 
that in effect k of the terms are fixed, so that only v — kol them 
are free. Under these circumstances, as is shown by {ii) at p. 
252 ; B ; 26, the mean square error of an observation of unit weight 

sTw:-/:)^] 

is — UL A ^ so that then (123) becomes modified to 

v — k 

s' [W^r(/r-/;)^]=v-fe ....(124) 


This principle has been used since the time of Gauss (see p. 175; 
A; 18). While it recognizes, in effect, the concept of * ‘degrees of 
freedom", it should be noted that, like (42), its derivation is 
based on the Principle of Insufficient Reason, or alternatively, 
upon other assumptions which are open to some argument (see 
p. 252; B; 26). 

In the case of a least squares graduation of ql or $1 this rela- 
tion for goodness of fit means that 

^lE'xP: 

where again in practice and g* would be taken as pi and ql, or 
as approximately graduated values, or as p^ and q^. This test 
was first employed by Thiele in 1871 (see p. 175; A; 18), with 
pi and qx being used for p^ and q^, so that Thiele's form was 


*1 

rS.J 


....(125) 


S iel-ely -] 


....( 126 ) 
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(v) The Test 

It has already been explained in Chapter VI that an observed 
series /ii/j, . . . ,/v, totalling N, may be regarded as having fallen 
into the v ‘^cells’* through the operation of the true ^'parent*' 
probabilities ^2» • • • » in accordance with the Multinomial 

1 

Normal Law (52), that is, 7 — „_i — 7========= e ^ where 

'-'(f'r-NPry Vp^p,...p, 


; and it was also pointed out that there are 


under those circumstances p — 1 '‘degrees of freedom’', since one 
linear “constraint” upon the possible values is imposed by the 
condition that 4-/2 +. . .+fl=N, It was further shown that 
the probability of getting a series with i' — 1 of its values exhibit- 
ing deviations simultaneously lying between ar\/n and 
(where f = l,2, . . . , 1^ — 1), with the j^th fixed by the ‘ ‘constraint* ’ , 
could be expressed as the multiple integral (54), and that both 
formulae are merely extensions of the corresponding (10) and 
(11) for the binomial case when iV = w, v=2, pi = p, and #>2 = 5. 
Since the process by which the multinomial expressions (52) and 
(54) were reached is thus a straightforward generalization of the 
binomial formulation, it may be of assistance again here to return 
to the binomial for the start of the development in order to 
establish clearly the basic principles of the Test for Goodness 
of Fit. 

It may therefore be recalled that in the binomial case it was 
shown (see p. 56) that when the deviation is any quantity 5C, then 

y 2 = , Suppose, therefore, that a particular observation has 


npq 
shown a 


deviation, e, for which x^ has the particular value 
This may be considered as an improbable (i.e., “poor”) 


observation if in a large proportion of other similar observations 
the deviations arising from chance alone would be smaller than c, 
so that X* would be less than xl- But by (10), as shown on p. 162; 
A; 5, the probability of a deviation less than e, in absolute magni- 

1 

tilde, i.e., lying between — c and +€, is e dx, 

V2trn/>2J“'* 



110 


The Tests of Goodness of Fit 


The region of integration here is for values of * ^ which means 

«2 ^2 - ^ 

values of — •' g , that is, values of ^ Xo» being values of 

npq npq npq 

Now it has been seen in (63) that this integral may be written 
y ^ where the limits may still be de- 

fined as the region for which ^ x\. Alternatively, by putting 

. if-- 

— = v®, say, that same integral becomes —p= \ e ^ dv, where 

npQ, V2irJ 

the integration is again over the same region, which is here that 
for which «^^^Xo* 

It therefore follows (P:f 4^:326) that in the multinomial case 
the multiple integral (54), which corresponds to the binomial 
(53), can be written 



where the domain of integration extends over all the values for 
which Vi + ^2 + . . . +Vy^i^ xj; and this will give the probability 
of getting, by chance alone, a set of deviations which will pro- 
duce a value of x* equal to or less than the Xo actually observed. 

The reduction of this multiple integral is heavy but not diffi- 
cult. The proof is too long for inclusion here; it is given excel- 
lently, however, in P:14e:331 and P:£l :II, 302, and leads to the 
simple integral 

r -1 


1 

(\/27r)'-‘ 




u i>~3 

e ^ du 


in which the "Gamma Function” is defined as stated on p. 269; 
B; 29. Putting u = 2 *, this becomes 


1 


>-3 

2 2 r 




6 ^ dz . 
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As already noted (and explained in Chapter VI) the v variables 
here are subject to one ‘‘linear constraint” (imposed by the con- 
dition that S/y =iN0, so that there are v — l “degrees of freedom”. 
Writing v — l =d to represent the degrees of freedom, the expres- 
sion therefore is 


1 p 


^ 

2 2 r 


Xo »» 


This formula, however, measures the probability of getting 
by chance alone a value of yf as small as, or smaller than, the x\ 
actually given by the observations, and is thus based on an 
appraisal of the poorness of the result. The probability of the 
opposite inequality, x^^Xo» will accordingly measure the “good- 
ness of fit”. Identifying that measure by P we therefore have 


P = 


d-2 

2 2 r 


(I) 


i: 


2 2*^-1 Jz ....( 127 ) 


where d is the number of degrees of freedom. 


In the preceding deduction there are v — 1 degrees of freedom, 
being the number of the variables, v, less one constraint (imposed 
by the requirement that the total frequencies falling into the 
V cells shall be N), Similarly, if there are, in any particular prob- 
lem, further conditions which restrict the values which can be 
assigned at will (as, for example, an additional requirement that 
the mean, as well as the total, of the frequencies shall be equal 
in the observed and the parent series), the degrees of freedom will 
be reduced by one for each such condition. The degrees of 
freedom, d, that is to say, will be taken as v — ky where v is the 
number of cells and k the number of constraints. 


In order to apply this formula it is therefore necessary to 
compute xl from the data, to settle the number of degrees of 
freedom d, to determine the numerical value of P — which may 
be done easily from tables which have been prepared — and then 
to interpret the result. The history of this x^ Test (or “Chi- 
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Squared Test”), as it is usually called, and the available tables 
of P, are recorded at p. 175; A; 19. 

The following precautions are imposed, by the preceding 
theory, upon the calculation and interpretation of P; {a) As 
already noted in the derivation of the Multinomial Normal Law 
of Deviations (50), the total N must be fairly large (probably 
at least 50), and no value of Npr should be less than about 10 — 
the latter restriction being surmountable in practice by amal- 
gamating the values of one or more adjacent cells, where neces- 
sary, in order to produce a frequency not less than 10; {h) Since 
P gives the probability, under random sampling, of getting a 
value of X® equal to or greater than the observed Xo» to be 
concluded, when P is small, that the observed xl n^ay have arisen 
from significant causes other than the merely chance variations 
of random sampling; (c) When P is large, however, the inference 
cannot be drawn that the observed Xo has arisen from random 
sampling alone — the proper inference is the negative one only, 
namely, that the existence of significant causes has not been 
proved; (d) Moreover, if P is found to be very close to unity, 
the result must be viewed with suspicion, because so large a 
value of P will ordinarily have arisen from a value of Xo so small 
as to raise doubts concerning the sampling technique employed. 

Now it is to be remembered that a value P = .05, for example, 
means that only in 5% of trials should we obtain a value of x^ as 
large as or larger than the observed x^ Any smaller value of P, 
accordingly, indicates an even smaller percentage of trials. It 
is, of course, a matter of individual preference to select a value 
of P at or below which all values would be considered small, in 
the sense that the investigator would view them as undoubtedly 
significant. The values P = .05, P = .01, and P = .001 thus chosen 
are often said to define the 5%, 1%, and .1% levels of signi- 
ficance, and smaller values than those designated are spoken of 
as lying below the particular level used. While judgment must 
naturally enter into the decision to be made in any particular 
case, experience with the practical use of the x^ test has led to 
the following suggestions which have been widely adopted: 
(i) If P is found to lie between .1 and .9, there is no reason to 
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suppose that significant causes have produced the value Xo 
observed; {ii) If, however, P is less than .05, there is good reason 
to conclude that a real discrepancy exists between the observed 
and theoretical values, and if P is below .02 such an indication 
is strong; {Hi) In testing the “fit** of a curve (i.e., in comparing 
observed and theoretical frequencies) it will generally be found 
that, when the data are very large, P will be small even though 
the fit appears to be very good — ol feature which may be due to 
the inability of the test to distinguish between heterogeneity 
and merely accidental variations, and to the fact that the basic 
assumptions of the theory are revealed, when the data become 
very large, as being not wholly appropriate (see V\32\2Q^ and 
P:jf0:526, and p. 342; C; 25); {iv) Values of P above .95 should 
be viewed with suspicion; and {v) When d exceeds 30, P can be 
found with sufficient accuracy (as suggested b y R . A. Fisher, 
P:4S; see also p. 176; A; 19) by computing V2Xo, and thence 
(from tables of the “probability integral** with unit standard 
deviation — see p. 176; A; 5) the area of the Normal Curve beyond 
the ordinate which is \/2Xo-~ — 1 units from the mean — the 

value of P so determined thei^ being interp reted as above; or, 
more simply, by computing y/2x\’-y/2d — \, from which it may 
be inferred, when this quantity is considerably greater than 2, 
that the observed Xo differs significantly from that expected. 

From the basis of the yf test as a method of examining good- 
ness of fit, and the foregoing suggestions concerning its practical 
application, it will be realized that the process often becomes, in 
effect, an attempt to compress the results of a wide variety of 
controlling influences and resultant deviations within the compass 
of a single figure, P. The precise inferences to be drawn there- 
from are consequently often obscure; the discovery of a value 
of P which does not lie between about .1 and .9 may indicate the 
need of caution — but while it will then prompt a search for causes 
of disturbance, it generally can indicate little of the nature of 
those causes. To that extent the method, at least in many 
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instances with which actuaries are concerned, suffers from the 
disabilities of any “index number”. For this reason the calcula- 
tion of 2 ^nd its resulting P, as a test of goodness of fit, will 
often — especially in work with large blocks of mortality data 
where, as already pointed out, P may be small even though the 
fit is good — remain somewhat of a theoretical criterion, and in 
the hands of a practical actuary will not replace an analysis of 
the deviations group by group. In fact, the analysis by groups 
will have to be performed whenever P is small or large, and even 
when P lies between .1 and .9 an actuary would generally not be 
content to forego such a enquiry. 

Notwithstanding these limitations of the method as a test 
of goodness of fit of mortality table graduations, it should be 
emphasized again, however, that the function occupies a funda- 
mentally important place in the philosophical and mathematical 
formulation of the theory of sampling, and should therefore be 
clearly understood by every student. 



X. RECENT RESEARCHES, AND MISCELLANEOUS 

PROBLEMS 


In this chapter we shall indicate very briefly — mainly, indeed, 
by a series of references only — a number of matters which, 
although of interest to actuaries and vital statisticians, are not 
of sufficient practical utility in connection with this study to 
require more extended treatment. 

(1) ‘‘Confidence” or “Fiducial” Limits 

In the problem of statistical estimation an important question 
concerns the limits, or '‘intervab*, or ‘‘belt”, within which an 
estimate of an unknown parameter may be expected to lie. 

For the simplest case of the point binomial, for example, an 

estimate of the parent probability p may be taken as — (cf. p. 

u 

291; C; 10) where, however, the confidence to be reposed in 

— =/>' as an estimate of p will, on the principles of Bernoulli’s 
n 

theorem, evidently increase as n increases. With p thus taken 
as p\ and when n and are not too small, so that the assump- 
tions of the Normal Curve are permissible, the relations deduced 
at the end of Chapter III indicate that p will lie between p' — 
2 (t[p^\ and } in approximately 95% of cases, between 
— 3X { />' } and /?' + 3X { p' } in about 96%, and between — 3(7 { } 
and ^'+3(7 {p'} in over 99%. The values 95%, etc., are spoken 
of in modern terminology as confidence coefficients. These con- 
clusions are employed, for instance, in the method noted at p. 174; 
A; 17 for setting up vertical bars to mark the confidence interval 
in a graphic graduation. The limits for values of n up to 1000, 
and for confidence coefficients of .95 and .99, are shown in a very 
convenient diagram form in the point binomial case by Clopper 
and E. S. Pearson in FilSAlOAll. 

Indications of the use of these principles arc given at p. 343; 
C; 26. 
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When p (or g) is less than about .03 and np is less than about 
10, the use of the Poisson distribution is indicated (see p. 60), and 
the confidence limits may consequently then be taken with advan- 
tage from tables or diagrams based on Poisson’s exponential (55). 
This modification has been discussed in P:^^, and again clearly by 
Ricker in 'P:108, An example from the latter paper is given at 
p. 344; C; 26. 

In the case of small samples (say n<30) the problem of con- 
fidence limits has also received increasing attention in recent 
years through the development of ‘‘Student’s distribution ”(44) — 
see, for examples, P:i 1:8:89-91. 

(2) The Theory of Estimation, and the Testing of Hypotheses 

The nomenclature, theoretical formulations, and practical 
inferences involved in modern statements of the theory of estim- 
ation and the testing of hypotheses have been examined and 
re-examined in a great number of recent papers. For the pur- 
poses of this study it is not necessary to attempt even any classi- 
fication of these contributions. The philosophical intricacies of 
the discussions have led to abstractions which have proved diffi- 
cult to grasp, and misunderstandings which have produced 
criticisms and sharp controversies. Some idea of these disagree- 
ments may be gathered from the long but extremely interesting 
interchanges in P:^, P:P1, and V\92y where Bowley, Isserlis, 
Jeffreys, R. A. Fisher, Neyman, E. S. Pearson, Greenwood, and 
others have explored the foundations of various logical ap- 
proaches. A connected and very comprehensive mathematical 
presentation is that of Wilks in P:15ff. 

(3) Orthogonal Polynomials 

In the fitting by least squares of the parabolic expression 
— , for which the ‘‘normal equations” have 

been discussed in Chapter VIII, it is sometimes desirable to 
ascertain whether the use of an additional term will effect an 
improvement in the fit. In this connection it is to be remem- 
bered that the values of a and determined by least squares for 
the straight line will not be the same as those for a similar fitting 
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of a second-degree curve which requires 7 as well as a and the 
problem of passing to a curve of higher degree therefore is not 
that of simply appending a term to others which have been found 
before (see, for example, P:f 77:321-4). By using the orthogonal 
polynomials of the famous Russian mathematician Tchebycheff, 
however; it is possible to follow a systematic procedure which 
utilizes all previous work when the parabolic expression to be 
fitted is extended term by term. 

Recent investigations by A. C. Aitken have contributed 
materially to the development of this method. Actuarial stu- 
dents may be referred conveniently to P:8J^ and P:S:115-120 for 
the theory of these polynomials, and to the same sources and also 
P:^5:92 for their practical applications. 

(4) Regression and Correlation 

The elementary text-books providing clear descriptions and 
worked examples of linear and non-linear, multiple, and partial 
“regression*^ and of linear and non-linear, multiple, and partial 
“correlation*', are now so numerous and accessible that we shall 
here simply refer the student to several of the most recent of 
such publications. For elementary discussions, P. R. Rider's 
volume {P:112\21 et seq.), W. D. Baten's (P:5:145-200 and 223- 
232), and C.H. Goulden’s (P:4^:52-87 and 219-246), are excellent; 
more mathematical approaches are given by B.H. Camp 
129-179 and 286-347), Elderton (P:S^:141-180 and 210-230), and 
H. L. Rietz (P:ii0:77-113); and a detailed treatment is provided 
by Yule and Kendall (P:777:196-308). 

The numerical illustrations of the fitting of linear and curved 
(polynomial) regression lines by the method of least squares 
(unweighted) which are shown in these texts will be of interest 
also in connection with Chapter VIII (and cf. p. 324; C; 21). 
R. A. Fisher's convenient summation method of fitting so that 
at any stage a further term may be added to a fitting already 
made without disturbing the previous calculations is described 
and illustrated in P:45:148-176; P:45:234; P:Jf45:156; and PxlSl : 
324. 
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It may be useful also to note here that the least squares fitting 
of polynomials, which usually becomes laborious when the data 
comprise a large number of terms and the polynomial is of degree 
above the 3rd, has been facilitated by the publication by H. T. 
Davis of tables of the required coefficients as far as 51 equidistant 
terms for degrees up to the 7th Another valuable con- 

tribution describing a method which is even simpler by reason 
of the small figures required in the solution (cf. P:28:117) is given 
by Birge and Shea in Pill. 

Certain applications of correlation theory to actuarial prob- 
lems are noted at p. 344; C; 27. 

(5) The Analysis of Variance 

The name ''analysis of variance” (cf. p. 163; A; 6) is a term 
now used to indicate a process — ^arising from the Lexis theory 
(pp. 30-33 here, and cf. P:J:54), based in fact upon the "cor- 
relation ratio” introduced by Karl Pearson in 1905, and since 
developed widely by R. A. Fisher — for dividing the total sum 
of the squared deviations of a variate from its sample mean into 
those distinct sums of squares, corresponding to supposed or real 
causes of variation, which give estimates of the variance from 
each such cause. The method is employed principally in con- 
nection with designs in biological and agricultural experiments, 
and is noted here because the student will encounter the name 
frequently in the perusal of modern texts on the general appli- 
cation of mathematical statistics. The subject has been covered 
in the first instance by Fisher in P:4S:216-306; the mathematical 
basis is described clearly in P:f 77:444-448 and P:S:54 and 136- 
142; and additional explanations and numerical examples are 
given conveniently in et seq., PilSl :179 et seq., and 

P:4S:114 et seq. 



XI. AN OUTLINE OF A COURSE IN GRADUATION 


The main objective of the preceding chapters has been to 
assemble those portions of the theory and applications of Mathe- 
matical Statistics which are required by actuaries in their studies 
and their daily practice. Amongst the various applications 
which have been indicated, the theory and practice of '‘grad- 
uation” — the fundamental concept of which is discussed in 
Chapter VIII — represents one of the most important subjects, 
as will be evident from the discussions of curve forms and fitting 
methods in Chapters VII, VIII, and IX. Other modes of grad- 
uation, however, are available, and are widely used; and all of 
them involve, in varying degrees, the principles of mathematical 
statistics considered in this volume. 

In the Preface and the Introduction it was pointed out that 
one of the chief difficulties encountered in the teaching of actuar- 
ial mathematics has always arisen from the hiatus which has 
existed between the elementary studies of "probability” and the 
student’s subsequent encounters with the advanced methods 
necessary for a proper understanding of graduation processes. 
The aim of the preceding chapters has been to bridge that gap, 
so that the student may be enabled to understand the mathe- 
matical concepts of the various graduation methods with greater 
ease. Co-ordination of the reading from many sources will still 
be essential, however — for although the underlying theories of 
mathematical statistics are to be found here, the expositions of 
several important graduation devices must yet be sought else- 
where. 

This final chapter will consequently indicate a course of read- 
ing, bringing the appropriate sections of this volume into relation 
with the available discussions of particular graduation methods. 
An outline only will be given, with the essential references, but 
without detailed explanations. By this means it is hoped that 
the student will be able to plan his reading systematically, and 
with so satisfactory an understanding of the necessary funda- 
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mentals that he will then experience no difficulty in grasping the 
theories and the practices of graduation as he proceeds. 

(I) The Nature and Objects of Graduation 

1. The basic concept of the graduation of a statistical series 
is explained at the beginning of Chapter VIII as, in effect, the 
determination of the ''true'* hypothetical "parent population", 
fu / 2 » . . . I fyt of which the "observed" /(, /a, . . . , can be con- 
sidered to be a random sample. In practice, however, it is usually 
possible to find only the "fitted" or "graduated" values, /j, 
... ,/^ as "estimates" of the true values /i, /a, . . . by some 
process which will satisfy some predetermined criterion of a 
"satisfactory" or "best" graduation. 

2. When the graduation is accomplished by fitting a mathe- 
matical formula (by any of the methods discussed in Chapter 
VIII) it will be clear that the main criterion must be a test of 
goodness of fit, since the values derived from the graduation 
process will lie on a mathematical curve, and therefore will be 
inherently "smooth". If, however, some other method of grad- 
uation is employed which does not necessarily place the grad- 
uated values upon an inherently smooth curve, it will evidently 
be necessary to test the results for smoothness as well as for good- 
ness of fit. In this connection it should be noted also, at this 
stage, that in such a graduation of irregular data it will obviously 
not be practicable to secure a best possible fit and greatest pos- 
sible smoothness at the same time — for the ultimate interpreta- 
tion of a "best possible fit" would require the precise reproduction 
of the original data, without any smoothness having been at- 
tained. It is therefore necessary in such cases to settle the 
criteria for fit and smoothness so that the practical results may 
be satisfactory, rather than best, in both respects. 

(II) The Tests of Fit 

1. After a graduation has been performed, the admissibility, 
i.e., the "satisfactory" character, of the results secured by the 
graduation process may be examined by the following tests: 
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(i) The frequency of changes of sign — see Chapter IX, 
section (i). 

(ii) The standard deviations (or probable errors) — see 
Chapter IX, section (ii). 

(Hi) Comparisons of financial functions — see Chapter IX, 
section (Hi), 

2. The goodness of fit may be tested by an examination of the 
extent to which the results satisfy the “least squares” criterion 
— see Chapter IX, section (iv), 

3. The hypothesis that the observed values may be supposed 
to have been drawn, by chance alone, as a random sample from 
the graduated series (so that the graduated values may be accept- 
able as a representation of the “true” values of the parent popu- 
lation) may be investigated by the test — see Chapter IX, 
section (v), 

(III) The Tests of Smoothness 

1. As already stated in (I), an enquiry into the smoothness 
of the graduated values is necessary only when the graduation 
has been performed by some method other than the fitting of 
an inherently smooth mathematical formula. In the graduation 
of most actuarial data it is sufficient (and customary) to assume 
that differences beyond the 3rd in the true values may be neg- 
lected (although in some cases this assumption should be exam- 
ined carefully — P:166:8lf 107, and 110). On this basis the small- 
ness of the 3rd differences of the graduated values would consti- 
tute a practical indication that the differences of higher orders 
would be almost negligible, and the sum of the squares of the 3rd 
differences would afford a comparative measure of the smooth- 
ness of the results (cf. P:59:8). 

(IV) Graduation by the Fitting of a Mathematical Formula 

Because a mathematical formula is inherently smooth, it is 
natural to place the single problem of fitting such a formula as 
the first method of graduation to be listed here — particularly 
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since the principles have been largely discussed in the earlier 
chapters of this volume. 

1. In actuarial work the selection of the form of curve which 
is likely to be appropriate may be made from a consideration of 
the underlying hypotheses and types set out in Chapter VII. 

2. The methods of fitting the selected form are described in 
Chapter VIII. 

3. Special devices for the graduation by Makeham’s formula 
(83) of the rates of mortality during the select period are noted 
in section (IX) hereafter. 

4. Particularly in the compilation of annuitants’ experiences 
where the annuity values are a most important function, or some- 
times if it seems advisable in any experience to merge the select 
into the ultimate values at a duration earlier than that strictly 
indicated by the data, it may be desirable to adopt a special 
method of graduation which will reproduce the annuity values 
(rather than the basic rates such as qx) as closely as possible. 
G. F. Hardy devised such a process, on the assumption of 
Makeham’s formula (83), for the British Offices’ Annuitants’ 
Experience. [The employment of the method in that experience, 
however, was occasioned not so much by the use of an arbitrarily 
short period of selection as by the basing of the ultimate table 
upon '‘aggregate” data, from which duplicates had been elim- 
inated in a manner different from that adopted for the “select” 
data. It has been pointed out in H:f(?ff:289 and 11:98:361 that 
if the ultimate table had been founded on “select” data (as was 
done with the table where the method here under discussion 
was not used), this special method would not have been required 
even though the select period would still (on the evidence of the 
material) have been arbitrarily short.] His own brief description 
in H:5d:127-131, and that given in P:5S:97-99, may be eluci- 
dated by the detailed explanations in P:77, while the proofs of 
the formulae required are amplified in P:f^^:548-551. [The 
references in this last discussion to pars. 88-91 of Henderson’s 
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Actuarial Study No. 4 (P:5P) are to the first edition — the cor- 
responding paragraphs of the second edition being numbered 
7.91. The symbol (I A) a; in the top line of p. 551 should be 
(IA)x]. 

(V) Graduation by the Graphic Method 

Instead of selecting and then fitting a mathematical formula 
as in section (IV), it is evident that a simpler alternative in many 
cases would be the drawing of a smooth freehand curve amongst 
the points which represent the data. 

1 . The elementary principles to be employed in a direct graph- 
ic graduation are discussed in P:5^:ll-16 and P:f^7:98. The 
limits within which the graduated curve should lie may be indi- 
cated as at p. 174; A; 17. The use of splines is described in V\152. 

The graphic method obviously can be applied in a manner 
to secure whatever compromise between fit and smoothness the 
graduator may desire. While this flexibility may be a disadvan- 
tage (in that there is no a priori criterion of the degree of com- 
promise to be secured), it is nevertheless a useful characteristic 
under some conditions — such as those explained in par. 10 of 
section (VI) here. 

2. As it is sometimes difficult to apply the graphic method 
directly in the graduation of mortality and similar material, a 
useful device is to adopt an approximate graduation by a mathe- 
matical formula, or the values from some standard table, as a 
basis, and hence to graduate graphically only the ratios or differ- 
ences between the data and the selected base. An illustration of 
this device is given in P:^P:17-19, The student should refer 
also to V\81y and to Chapter VII, section (5), p. 85 here. 

3. A graphic method of determining the constants in Make- 
ham's formula (83) in the form colog ^a; = a+/3r®, by the use of 
^*semi-logarithmic" paper, is noted at p. 103 of Chapter VIII. 

4. Methods of handling the graphic graduation of select rates 
of mortality are set out in section (IX) hereafter. 
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(VI) Graduation by Finite Difference Interpolations 

The assumption that differences in the true values, fr, may be 
neglected beyond a certain order, j, means t)xditfx=^A-)rBx+Cx^ 
In most actuarial data it is legitimate to take 
j =3 (although, as already noted in section (III), this supposition 
should be examined carefully in some cases). 

1. The first step in approaching the graduation of observed 
values by finite difference methods obviously would be to choose 
the value of 7, to select single points in the data equal in number 
to the constants J5, . . . , 7 in the above polynomial, to solve 
the simultaneous equations for i4, B, . . . , 7, and thence to com- 
pute the interpolated values (see T?:166: par. 6). The selected 
points, however, remain unadjusted in this process. 

2. Instead of using single values, therefore, groups of values 
might be selected, so that consecutive sums rather than single 
points would be retained — for example, by using 2 log^* in de- 
cennial sums, i.e., log (iop«), the original decennial probabilities 
of living would be undisturbed, but the yearly values px would 
be redistributed {V\166\ pars. 7 and 12). The formulae of this 
method may be reached easily as in V\166\ pars. 8-11. This 
process may be viewed as one of ‘‘fitting” in which the criterion 
of ‘‘fit” is the identity of the graduated and ungraduated conse- 
cutive sums. 

3. When ordinary interpolations are made from selected 
points, or separate abutting sums, however, breaks of continuity 
occur at the points of division. An early empirical method widely 
used for the smoothing of these breaks was a double interpolation 
based on two overlapping series and then blended by the factors 
of the Curve of Sines (P 767:101). Other overlapping methods of 
ordinary interpolation also may give good results (P:/66:102, 
footnote). 

4. The problem of securing smoothness in the interpolations 
at the points occupied by the original data is dealt with more 
satisfactorily, however (P:167: par. 99), by employing the method 
of osculatory interpolation. The principles embodied in the var- 
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ious formulae embraced by that designation may be understood 
easily by tracing their development in the following order, which 
is mainly chronological. 

(i) Sprague's original 5th difference osculatory formula is 
given, with references for its assumptions and demonstrations, 
in V:167: par. 100. 

{ii) The corresponding Karup-King 3rd difference osculatory 
formula is stated similarly in V:167: par. 101. 

{Hi) The methods of derivation on which {i) and {ii) are 
based have been generalized by Reilly for differences of odd order 
2A+1 and contact of order k with the partial curves — Sprague's 
form of proof being covered in and Lidstone's in 

Thus with A = 2 and k=2 Sprague’s formula {i) is given; the 
expressions for 5th differences and 3rd order contact (A = 2, fe =3), 
7th differences and 2nd order contact (&=3, ^=2), and 7th dif- 
ferences and 3rd order contact (A =3, ^ =3) are also sfiown. For 
practical purposes, however, these extensions are not usually 
required. 

{iv) Another 5th difference formula founded on Sprague's 
basic assumptions, but with the variation that the differential 
coefficients are determined from the mean of their values in the 
partial curves, was evolved by Buchanan (H:n4:372-4; see also 
P:7i:1^8 and P:ff^:90). Sprague’s formula {i) is preferable, 
however, because of its more convenient numerical coefficients 
(P:^:124). 

(t;) In Sprague's assumptions the partial curves for deter- 
mining the differential coefficients are taken as of one degree less 
than that of the osculatory curve. Henderson pointed out (H : 
103:215-7) that this restriction is not necessary, and accordingly 
produced an improved formula given, with references, in P:167: 
par. 103. 

{vi) The use of partial curves in the preceding applications of 
Sprague's assumptions is, in fact, somewhat arbitrary. Hender- 
son accordingly suggested (H:f 05:219, and P:54:190) that it is 
preferable to discard the partial curves, and instead simply to 
impose the condition that the differential coefficients at the points 
of junction should be continuous. The 5th difference formula 
which he thus reached is given, with references, in P:167 : par. 104. 
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(vii) The preceding formula (vi), however, as observed by 
Henderson {H:103:22l and P:54:186-7 and 191), is not truly 
osculatory, for there is a discontinuity in the first differential 
coefficient of A 5® resulting from the fact that the formula is 
based on only an approximate solution of the difference equation 
involved. He therefore gave (H:f 4^:24) an exact method as 
stated in F:167: par. 105 (see also P:5S:119and 121-4 for a fur- 
ther mathematical discussion). 

(viii) The development of a whole set of exact osculatory 
formulae has further been shown clearly by Jenkins in P:ffP, on 
the same assumptions as Henderson's in {vi)^ namely, (a) that 
the, two curves must take the given value at the common point, 
and (6) that the corresponding derivatives of each curve at the 
common point must be equal to each other (though not neces- 
sarily equal to any predetermined value as in Sprague's assump- 
tions). The Karup-King formula (ii) forms one of the set; the 
other formulae reached were new. For the 5th difference case, 
assumption (a) is met By taking <p(x) —x{x — l)yl/{x)f and ^pix) *= 
ao+aiJc+a 2 ^^, which gives Jenkins' formula; \l/{x)=ao+aiX+ 
a 2 X^+aiX^ (which is redundant to the extent of one degree in x 
since the term azx^ is not necessary) produces Sprague's formula 
(i) when a 2 = — and Buchanan's (iv) when a 2 = ~ i; and 
}p(x) =ao+aiX gives Henderson's formula (vi). 

In a subsequent paper (P:70) Jenkins also developed another 
set of truly osculatory formulae of minimum degrees in all cases. 

In P:72:24 and 30 he further produced the formulae based on 
even instead of odd differences. 

(ix) All the preceding formulae deal with the usual problem 
of osculatory interpolation from data at equal intervals. The 
case of unequal intervals on Sprague's assumptions is considered 
by Ackland {11:124) and Reilly {11:169). 

5. The principle of osculatory interpolation was first generally 
applied in the preparation of mortality tables from the death 
registers and census returns of population statistics, for which in 
earlier years the data were given in age-groups only — tabulations 
by single years of age not being available (P:f^:100 and 102). 
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It was realized, moreover, even when the data can be secured by 
single years, that the errors in the statements of age which are char- 
acteristic of such population material (P:i 57* 26-57, and P: 172) 
may often be dealt with satisfactorily by grouping in suitable age 
classes, on the principle that the totals in quinquennial or decen- 
nial age-groups may be assumed to be correct although the values 
within each such group may require redistribution. The problem 
of interpolation therefore was to effect a smooth redistribution 
of the grouped data into the values at each age, so that the irre- 
gularities at individual ages would be removed without unduly 
disturbing the totals in each group. 

This reproduction of the grouped data being fundamental, 
Shovelton (H :123) therefore investigated the effect of introducing 
that requirement directly into the determination of the oscul- 
atory curve, and found a formula given in P:^^7: par. 102. As 
pointed out in this method imposes one condition less 

than in a truly osculatory formula. 

6. All the formulae derived by Sprague’s assumptions which 
use partial curves, or from Henderson’s which discard the partial 
curves and simply require that the differential coefficients shall 
be continuous, have in common the other basic condition that 
(being formulae for interpolation) the two curves must take the 
given value at the common point. When such osculatory for- 
mulae have been used, as in many of the population tables, to fill 
in the values between certain predetermined points, it has been 
found — unless the values at those points themselves lie upon a 
smooth curve — that the whole curve which finally results will 
show many undulations and points of inflexion, even though it 
will be free from discontinuities (P:^^:124). In order to meet this 
weakness Jenkins (P :71 ) therefore released the two curves from the 
requirement that they must take the given value at the common 
point, and instead permitted, in effect, that the interpolated 
value shall differ from the given value by a fraction of its 2nd 
difference in the 3rd difference formula, or of its 4th difference 
in the 5th difference formula (see P:7i:201 and 202). An excel- 
lent demonstration of the precise assumptions involved is given 
by Lindsay in P:^^:211, where it is shown that, in the 5th differ- 
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ence formula, *^the ordinates and second derivatives must pass 
through points differing from the predetermined points by the 
values of certain functions, while the first derivatives must exactly 
assume the predetermined values’* (cf. also P:f 0^:189 and 220, 
and P:J4:160-1). Since the ordinates at the predetermined points 
are thus not to be reproduced, the resulting modified formulae 
will evidently effect some adjustment of those values in addition 
to performing an osculatoiy interpolation for the intermediate 
values. Jenkins gave a general expression, and the 3rd and the 
5th difference formulae, in P:7i :199-206; and the general form 
for even orders of differences, and the 4th difference formula, in 
P:7i^;10-12 and 24-30 [see also P:14AS8 for complete proof of the 
5th difference formula by Jenkins* method (noting the comment 
at F:1 06 :189)t and F:86 :211 for Lindsay’s elegant alternative 
demonstration]. In practice it is important to choose the for- 
mula which extends into an order of differences alternating in 
sign, for otherwise the graduated series will lie everywhere above 
or below the observed points; in mortality experiences, therefore, 
where the 2nd differences are usually not negligible while the 4th 
differences change in sign frequently, it is found that the 4th or 
5th difference '^modified” formulae are suitable but the 3rd dif- 
ference one is unsatisfactory (P:7i:202 and P:7;^). 

Since all the preceding osculatory formulae interpolate in the 
middle major interval, special treatment is necessary for the 
initial and final intervals. It may therefore be noted here that 
Jenkins gives the formulae for the two intervals at each end with 
his 5th difference ^‘modified” formula of par. 6 preceding (P:71 : 
209; see also P:74:135 and 140). It has also been pointed out 
by Lidstone (P:8^) that precisely the same results can be ob- 
tained by inserting the value 0 in the places of the missing values 
of 5* preceding the first and succeeding the last known 6^, com- 
pleting the difference table by addition to get the artificial values 
of the missing differences, and then applying the usual osculatory 
formula throughout. 

7. Observing that Jenkins’ “modified” 5th difference formula 
is not completely determinate, but only one of many satisfying 
the necessary conditions, Reid and Dow (P:106) have remarked 



An Outline of a Course in Graduation 


129 


that one arbitrary constant is avoided by the implicit condition 
that no differences beyond the 5th are to appear in the final 
formula. By relaxing this condition, therefore, they obtain a 
general 5th degree '‘modified** formula, which completely satis- 
fies the osculatory conditions, with 3 instead of only 2 arbitrary 
constants — ^Jenkins* expression being the special case of this 
general formula when the new constant, 6, is zero. In practical 
application, Reid and Dow accordingly suggest that the values 
should first be calculated by Jenkins* formula, and that the flex- 
ibility afforded by the h term should be used to improve the 
results as may seem desirable. The procedures which they 
actually adopted are noted in par. 10 hereafter, since they are of 
a type belonging to the methods there stated. 

8. All the foregoing interpolation processes are .designed to 
secure smoothness of junction in the final results at the pre- 
determined points, of division of the observed data. In some of 
the practical applications the data at these points have been left 
unchanged — only the intermediate values being interpolated by 
one of the osculatory formulae (P:i&7: par. 99). In order to grad- 
uate the values at the points of division, therefore, George King 
proposed a simple method in which adjusted quinquennial 
pivotal values are first calculated centrally by ordinary 3rd differ- 
ence interpolation from three quinquennial sums of the data, 
whence the intermediate values are supplied by 3rd difference 
osculatory interpolation {V:167: par. 107 and 110). King applied 
his method separately to the deaths and populations; in more 
recent examples (P:^70:334 and P:74:129 and 149) a single appli- 
cation to qx has given equally good results. In H \155 it was found 
convenient to compute the pivotal values at the first Instead of 
the central ages of each quinquennial group, by the formula 
resulting from putting w = l and x = —2 in (7) of P:/^^:87. 

If it is thought desirable to determine the pivotal values from 
more than three groups, the corresponding formulae based on 
four groups and j =3, or on five groups and j = 4 or 5, may be used 
as given in V:167: par. 108. Fifth difference osculatory interpola- 
tions have also been used widely for the subsequent intermediate 
values. 
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9. The pivotal values in the method of the preceding para- 
graph are all found by ordinary interpolation from groups. Since 
the real objective in the calculation of those values, however, is to 
obtain reliable points which represent the original data ade- 
quately but yet remove any undue fluctuations, and also because 
the subsequent osculatory interpolations passing through such 
predetermined points have a tendency to show undulations and 
points of inflexion (cf. par. 6 here), it has been suggested that the 
pivotal values might be determined from the quinquennial sums 
by using the formulae of osculatory instead of ordinary interpola- 
tion — the intermediate values thereafter being supplied, of course, 
by a corresponding osculatory process. This proposal was made 
first by Buchanan; a subsequent investigation by Jenkins, how- 
ever, re-examined the idea with certain earlier stages of its 
development which will therefore be given here in their natural 
order: 

(i) The pivotal value formula based on Sprague’s original 
5th difference osculatory expression (see (i) of par. 4 here) is 
given in P:7;^:14. 

(ii) The pivotal formula based on Jenkins’ modified 4th 
difference method (par. 6) is demonstrated in P:7^:13. 

(iii) The pivotal formula based on Jenkins’ modified 5th 
difference process is derived in P:f4’128-9. 

The practical effects of these variations in procedure have 
been tested in P 14 - and P:7;^. Since (as Buchanan observed) the 
pivotal values (which he called ‘'guiding ’’values) would tend to 
lie on a smoother curve when they are computed by the “modi- 
fied” formulae, it was found (as would be expected) that, with 
regard to smoothness, Sprague’s osculatory interpolation as a 
basis for both the pivotal and subsequent calculations was im- 
proved slightly by the modified 4th difference method, which in 
turn was improved by the modified 5th difference process, while 
of course the order was reversed in respect of fit. 

For the determination of the first two and last two pivotal 
values with these methods, Lidstone has suggested (P:S;8:278) 
the method of inserting zero values for the missing differences as 
in par, 6 of this section. Buchanan, however, objected that this 
has the effect of assigning definite magnitudes to the pivotal 
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values sought, and that unless they are reasonable the curve may 
be distorted; he therefore prefers the insertion of reasonable 
values for the missing terms, calculation of the resulting differ- 
ences, and then interpolation so that the curve would not pass 
through the pivotal values (P:^5:209). 

10. At the beginning of the preceding paragraph it was indi- 
cated that the objectives to be borne in mind in determining the 
pivotal values should be the adequate representation of the orig- 
inal data and, in the final analysis, the derivation from them of 
satisfactorily smooth results. The pivotal values, consequently, 
should fit the data in accordance with some adequate criterion, 
which, however, must still permit the eventual construction of a 
smooth curve. The necessary compromise between fit and 
smoothness thus inherent in the process may therefore be facili- 
tated by determining the pivotal values by a method which will 
specifically recognize the requirement of fit as well as of smooth- 
ness — the subsequent interpolations then being made by one of 
the osculatory formulae already considered. 

(i) This evidently might be done by using a graphic method 
(P:7i:206) to find the pivotal values, since that method can 
easily produce any compromise between fit and smoothness that 
may be desired (see par. 1 of section (V) here). 

{ii) The criterion of fit adopted could be that of least squares, 
by which the ''best** fitting pivotal values would be computed 
from, say, the quinquennial groups in order to effect the greatest 
possible reduction of the mean square error (P:i4^:368). The 
formula for five symmetrical groups is deduced as (51) in P:167: 

x—n4-2 

112-3 (and in P:i65:105, where Wn= S Wx)- The principles un- 

x—n- 2 

derlying this method are summarized at p. 282; C; 7, section 
(xii) (cf. also par. 2 (i), section (VII) here). 

(Hi) The flexible b term of the general modified formula of 
Reid and Dow (par. 7) may be brought into the calculations. 
In P:106 they illustrate the following alternatives: (a) b taken 
numerically to make the formula resemble Everett's formula 
closely; (6) the pivotal values found as in (a), but in the inter- 
polations b taken to secure maximum smoothness by making 
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S(A*g*)* a minimum; (c) a double application of method (6) ; and 
{d) h determined for the pivotal values to make the expected 
equal to the actual deaths, and in the interpolations to yield 
maximum smoothness therefrom. As would be expected, the 
results of {d) were the most satisfactory. 

(VII) Graduation by Linear Compounding (including 
“Summation” Formulae) 

1. The basic principle of this method has already been stated 
at p. 282; C; 7, section {xii). The manner in which it was first 
developed, from the interpolation methods of pars. 1, 2, 4(i), and 
4(w) of the preceding section (VI) here, is set out fully in T?:166: 
pars. 1-20, and the detailed references there stated. The linear 
compounding formulae so evolved were based on the preliminary 
selection of certain interpolations, and were therefore somewhat 
fortuitously dependent on the particular selection made. The 
very important pioneer work of De Forest (P :166) should be noted 
here particularly, since his investigations, beginning in 1871, on 
the theory of reduction in the mean square error, by which the 
capacities of the various formulae could be compared, a priori^ 
in respect of fit and smoothness, constitute the earliest and most 
complete treatment of the subject — a fact which still seems to be 
not fully recognized. 

2. Having thus established the method of comparing the a 

priori fitting and smoothing abilities of linear compounding for- 
mulae, De Forest then — between 1871 and 1880 — gave a remark- 
ably clear and exhaustive examination of the whole subject, 
which unfortunately remained unrecognized until the appearance 
of in 1924. Equipped with this measure of the reduction 

in the mean square error of the graduated term itself, or its differ- 
ences, I>e Forest published the following important series of 
formulae: 

(i) The symmetrical formulae, up to 25 terms for j = 2 or 3, 
which give the greatest possible reduction in the mean square 
error of the graduated term itself. He also indicated the corres- 
ponding unsymmetrical formulae. [De Forest noted that 
Schiaparelli had made an independent investigation in 1867 ; and 
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Woolhouse had considered the problem with a slightly different 
objective in 1865. The formulae, with the additional cases of 
j = 4 or 5, 6 or 7, and 8 or 9, were restated many years later 
by Sheppard and Sherriff, without knowledge of De Forest’s 
work]. See F:166: pars. 1 and 22-26. These formulae give the 
best possible fit, according to the criterion of least squares. 

(ii) The most important formulae which he published were 
the symmetrical series, up to 25 terms when j =3, which give the 
greatest possible reduction in the mean square error of the 4th 
differences (P:166: pars. 27 and 30). De Forest also gave the 
7-term formula for the maximum reduction in the mean square 
error of the 6th differences when j =5, for use when a series varies 
so rapidly that jV3 (loc. cit., 107, footnote). These formulae, 
which first appeared in 1873, antedated in their fundamental 
conception a large number of very similar investigations (dealing 
with the 2nd or 3rd differences instead of the 4th) by much later 
writers (see par. 3 below). They are designed to give the greatest 
possible smoothness, according to the criterion of the reduction 
to be anticipated in the mean square error of the 4th differences. 

(Hi) Recognizing that it is not possible to secure the best fit 
and maximum smoothness at the same time (cf. par. 2, section 
(I) here), De Forest gave the formulae up to 15 terms which 
emerged from an examination of the curve to be anticipated for 
the flow of the linear compounding coefficients. These formulae 
have good fitting as well as smoothing power. Again here his 
conclusions preceded by many years the work of subsequent 
investigators (P:166: par. 32). 

(iv) Another series with good values in respect of both fit and 
smoothness which De Forest gave is shown in P :166: par 33. 

(v) He also made an extensive investigation of the effects of 
applying some of his formulae repeatedly, and in this instance 
also reached some important conclusions on a matter which was 
suggested, but not closely examined, by others in later years 
{P:166: par. 34, and footnote). De Forest noted clearly the 
manner in which, when any linear compounding formula with 
j = 2 or 3 is repeated a large number of times, the curve of the 
coefficients ultimately tends to a central bell-shaped portion with 
an infinite number of small undulations at each end. 



134 


An Outline of a Course in Graduation 


3. As stated in {ii) of the preceding paragraph, De Forest 
determined his formulae for maximum smoothness when j = 3 by 
minimizing the mean square error in the 4th differences. Many 
years later the precisely analogous problems with respect to the 
3rd or 2nd differences (when j=3) were investigated in several 
independent enquiries — ^all again without knowledge of De 
Forest’s contributions. The formulae for the 2nd differences 
were considered by Hardy and Sheppard ; those for the 3rd differ- 
ences by Henderson and Larus. The details are given in 'P:166: 
pars. 28 and 29. 

4. All the formulae which are derived by the methods of pars. 
1-3 emerge in the linear compound form given at (a) and (i), 
p. 283; C; 7, section {xii). Numerical calculation by such expres- 
sions offers no difficulty with modern calculating machines — a 
device explained in P:7^:20-21 being useful in the work. 

Before the widespread use of calculating machines, however, 
the computations were greatly facilitated by putting the for- 
mulae into a special ‘^summation” form, i.e., in sums [/>][g][r] . . . 
where denotes the sum of p u*s of which w* is the middle 
term; and that method is of course still very convenient if the 
formula can be so expressed. The early literature consequently 
gives many examples of this form. In some instances the linear 
compounds were changed into the summation type by trial 
(H:i 05:643); in others the governing summation could often be 
selected from inspection of the central coefficient in the linear 
compound, whence the remainder of the formula followed easily 
(H:ni: vol. XLII, 133), 

Since, however, it was sometimes difficult to find these sum- 
mations, a great deal of attention was paid to the direct produc- 
tion of summation formulae correct to 3rd or 6th differences, 
without first deriving the corresponding linear compound, and 
without any attempt to secure any maximum or other specified 
degree of reduction of mean square error in either the term or 
any order of its differences. References to the long list of re- 
searches are given in F:166: par. 21; the r6sum6 in P:S7;260-267, 
and the table at P:65:53, will now generally be sufficient for the 
student’s purposes. It should be realized that the foundation of 
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these methods is the simple formula for [n]tt. as first employed 
by Hardy to 3rd differences in P:50:371 (noting the misprint 

24 M* — 1 

of for in formula (1) on p. 371 thereof), and deduced 

w^-l 24 

also to 6th differences in H 05:534-7 and Viilll : vol. XLII, 137-8. 

Amongst these methods the name ‘Vave-cutting’* : 

vol. XLII, 111) has been applied to certain formulae character- 
ized by the use of very unequal summations, resulting in a curve 
for the linear compounding coefficients which has a broad flat 
top. A formula of this type first suggested by Hardy in P:50:375 
has been discussed in H 7 : vol. XLII, 131 and 109-111. Another, 
composed entirely of even summations, has been deduced by 
Vaughan, and illustrated, in P:f 47" -434-440. The whole problem 
is also discussed further in P:7 4 ^-477-487. 

6. (i) While the investigations covered by the references in 
the preceding paragraph dealt almost entirely with the direct 
production of summation formulae correct to 3rd or 6th differ- 
ences, it was remarked by Hardy that an increase in smoothing 
power would sometimes result from changing the summations of 
a 3rd difference formula even though the change would introduce 
a 2nd difference error (see H:f J0:69, and cf. H :5<S:277 and P:60: 
374). 

{ii) For a rough preliminary adjustment only (when it is de- 
sired merely to obtain results in general conformity with the data), 
even simpler formulae of this character, using summations with- 
out any operand, may occasionally be useful (see P:5f : 35, foot- 
note, and P:f47:431). 

{Hi) These formulae with 2nd difference errors have also been 
employed as a basis for evolving special formulae for the grad- 
uation of colog px. Thus Spencer (H:7f;^:402-7) derived an ex- 
pression by providing that the errors due to the 2nd and 4th 
differential coefficients were practically counterbalanced. This 
principle was further developed, and additional formulae given, 
in P:f50, and P:f4S:465-470. 

6. It was observed in par. 4 that the various summation 
formulae correct to 3rd or 6th differences have usually been 
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deduced without any attempt to secure, a priori^ any specified 
degree of reduction of mean square error in either the graduated 
term or any order of its differences. In P \61 :29, however, Hardy 
suggested the possibility of selecting a convenient set of summa- 
tions, and then determining the rest of the formula (the * ^oper- 
and*') to minimize the mean square error in one of the orders of 
differences. This idea was investigated by Vaughan in 
443-5. Neither Hardy nor Vaughan, however, seems to have 
had any knowledge of De Forest's more exhaustive work. 

7. Since almost all the linear compound, and all the summa- 
tion, formulae are symmetrical, it is necessary to use corres- 
ponding unsymmetrical formulae, or a formula of short range, 
or special devices, to reach certain terms (depending on the range 
of the symmetrical formula) at the beginning and end of the 
series to be graduated. In practice special devices are often 
employed. Their purposes and methods are described suffi- 
ciently in P:^P:60-61 (noting that Ackland's process is explained 
in H:^5:357, and that examples of other methods may be found 
in H:97:340 and ll:112:d7S] HJIOiQO; H:ff^:373; and 
113-117). 

8. The graduation of rates of mortality during the select 
period may be dealt with by the device noted in section (IX) 
hereafter. 

It is important to realize, both for an understanding of this 
section (VII) and the next, that all the linear compounding 
graduation formulae (whether they can be thrown into a con- 
venient ‘‘summation” form or not) effect, in reality, some par- 
ticular a priori degree of reduction in the mean square error of 
the graduated term itself, and also some other a priori degree of 
reduction in the mean square error of each order of differences 
of that graduated term. The degrees of reduction thus attained 
vary, of course, according to whether the formula is designed to 
secure (a) the greatest possible reduction in the mean square 
error of the graduated term, say /y, itself (accompanied by some 
other degree of reduction in respect of the differences) — these 
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being the '‘fitting*' formulae of par. 2(i); or (b) the greatest pos- 
sible reduction in the mean square error of one of the orders of 
differences, say A*/^, of the graduated term (accompanied by 
some other degree of reduction in respect of the term itself) — 
these being the “smoothing” formulae of par. 2{ii) when j = 3 
and 2 = 4, or those of par. 3 when j = 3 and z = 3 or 2; or (c) such 
degrees of reduction (not in any case being a greatest possible 
reduction) in the mean square errors of the graduated term and 
its differences as may emerge fortuitously from the particular 
linear compounding factors chosen — these being the formulae of 
pars. 2(w), (iv), and (?;), and pars. 4, 5, and 6. 

The first main defence of the formulae of category (c) lies 
in the fact (already emphasized) that it is not possible, either in 
constructing a formula a priori, or in testing a graduation a 
posteriori, to secure at the same time the best possible fit and 
the best possible smoothness. In many cases — though not all — 
it may therefore be advisable to sacrifice something by way of 
fit in order to improve the smoothness, or vice versa, and so to 
choose deliberately a formula which combines both fit and 
smoothness in reasonable proportions. The second main defence 
of all the linear compounding formulae of section (VII) is that 
each graduated value is calculated from only a limited range of 
the ungraduated data, so that in effect a redistribution is secured 
on the basis of the information supplied by the adjacent terms 
(depending on the range selected), without recourse to very dis- 
tant terms which can hardly have any proper bearing on the 
value being graduated. 

(VIII) Graduation by the Difference-Equation Method 

1. If the whole range of the observed values,/^, extends from 
r = l to r — v, the linear compounding formulae of section (VII) 
perform their graduations by progressive applications of the same 
formula over the successive partial ranges covered by the number 
of terms in the formula selected {F:166\9A). The problem, how- 
ever, of deriving all the best “fitting** values, according to the 
least squares criterion, from a series of observed data, may 
be viewed (cf. Chapter VIII) as the problem of minimizing 
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2 ifr-fr) 2 over the whole range, when the weights are taken 

r»l 

as uniform; and this process is equivalent to effecting the greatest 
possible reduction in the mean square error oify (cf. V\166\ pars. 
1, 22, and 23). The best “smoothing** linear compounding for- 
mulae (which were considered in pars. 2{ii) and 3 of section (VII) 
when 2 =4, 3, or 2 for j =3) likewise effect, for the selected range, 
the greatest possible reduction in the mean square error of one 
of the orders of differences, A*/y, and in section (III) here it was 
noted that 2(Ayy)2 would afford a measure of the comparative 
smoothness of the results. Since, however, it is impossible to 
secure at the same time the maximum of fit and smoothness, it 
is evident that a compromise may be obtained by combining 
those desiderata in some stated proportion, so that the blended 
function to be made a minimum could be taken on analogous 

fwmp 

principles, over the whole range, as 2 (/"r—fly + 2 (^‘/r)“> 

r«l r— 1 

where k is the arbitrary proportion, and the upper limit for the 
smoothing term is r = because A* is not available for higher 

values in a series ending at r = [A mathematical derivation of 
this function, by employing the principles of probability and 
Bayes’ formula (43b) of p. 222; B; 12, has been given by Whit- 
taker in P:jf 55:304-6.] 

2. Since this function to be minimized is thus only a com- 
bination and extension of the principles already dealt with in 
linear compounding, it will be clear that the result must be 
expressible as a linear compounding formula, and that it will 
cover the whole range from r = l to r = v instead of a limited 
range only. This relation between the methods of section (VII) 
and those to be now considered here is important, as will appear. 
It may indeed be useful, at this stage, to remark that the shape 
of the curve taken by the linear compounding coefficients for this 
treatment over the whole range resembles closely that stated by 
De Forest for the formulae of limited range (par. 2 (in) of section 
(V) here, and P:166: par. 32), with the numerous small undula- 
tions at each end which he also observed when any linear com- 
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pounding formula is repeated frequently (par. 2{v) of section 
(V) here, and F:166: par. 34, footnote). 

3. The relationship pointed out in the preceding paragraph 
may be seen readily by minimizing the expression of par. 1, and 
then following the method of solution first shown by Aitken 

and demonstrated very clearly by Spoerl (P:l 5^:423-5). 
The minimizing is effected easily by differentiating with regard 
to each of the unknown values,/,,, in turn, and equating to zero — 

remembering, since all the/^’s are independent, that-^, (fl) =1, 

d If ^ 

and that — j, (/*+/) =0 when ^7*^0. Expanding the expression as 

dfx 

ki^(frY-2frfrMfr)^ ‘[A.- + •..+(- 1)*/:]. 

and for differentiation with regard to therefore discarding all 
terms not involving fx, the v equations for the solution of the 
V unknown /,,’s emerge at once (as shown in slightly different 
notation, for 2 = 3, at P:f 5^:404 and P:^^:300-l — noting that in 
lines 3 and 4 at the commencement of the latter discussion on 
p. 300 the 2 should cover both the terms, and that in the second 
line of formula (2) on p. 301 the subscripts of the last two terms 
should be a + 1 and a). Except for the first z equations for 
/ii/ 2 » • • • » and/j, and likewise the last z at the other end, all the 
equations are of the same form fe(/r— /r) + (■~l)*A2*/'_, =0; 
and the first z and the last z differ only in the successive omission 
of certain differences. The problem now is to solve these v 
equations, and so to find the unknown graduated /J^'s. 

4. In seeking this solution the first point to be noticed is that 
if, instead of covering only values from 1 to v, the series were 
of indefinite length from — oo to + oo, so that the equation 
^(Z^— /')+( — l)*A2*/^_, = 0 held unchanged throughout, the ex- 

r*-}-oo r»»+oo 

pression to be minimized would be ^ S i/r ^f 'rY + S ; 

fB — CO r- — 00 

and this would reduce to the expression for the range 1 to if all 
the values before r = 1 and after r = v were zero — that is, if fr —fr 
and A*/^ = 0 for r<l or >v. But when/y=/^, it follows that 
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= it is therefore only necessary at each end to supply 
new ungraduated terms, such that A*/y =0. This means that 
at each end additional ungraduated terms should be supplied 
with their values of A*"”y*y equal. Some method must therefore 
be settled for annexing those terms, and then completing the 
solution for all the v values of required. 


5. The problem has been examined in a number of papers, of 
which the following brief chronological account will give the 
necessary references. 

(i) The first statement of the whole principle here under 

discussion was given by E. T. Whittaker see also P :155: 

303). An approximate solution was suggested at that time; but 
it may now be discarded as greatly improved methods have been 
developed (cf. P:;^>^:24). 

(ii) A complete theoretical solution was published next, again 
by Whittaker (H :14 ^) ; but it also may be discarded on account of 
the extremely unwieldly figures which it involved (cf. P:24-18). 

(Hi) A method when 2;>3 was then given by Henderson 
which depended on resolving the difference equation (of 
par. 3 here) into two factors. The final result is thus reached 
by constructing an “intermediate series” — the calculations being 
performed in two steps. The process can be applied for any 
value of k, although the factors are more easily handled when k 
takes certain special forms. When 2 = 3, for example, the differ- 
ence equation for all but the first 3 and last 3 terms is 

— /y) — A®/y _3 = 0, which may be written ^1 

where £ = 1 + A ; and when k takes the form 

w(«+l)®(w+2)®(»+3) 

the operator Tl — A®jE“®l can be resolved 

L 16(2«+3)2 J 

into the two cubic factors 



A®jE~3' 


') /r=/; 


16(2«+3)* 


[l-,i 


2 4(2«+3) J 

1 1+n • 

L 2 ^ 4(2n+3) J 
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In order to start the process the terms at the beginning are 
first estimated by using constant values of those at the end 
are derived from the results obtained from the first step. The 
problem of estimating approximate terms at the beginning, in 
the manner originally suggested, by Henderson, is discussed in 
P:5^:36-38, P:^;^:7-10 (see also P:(?f :516-517), and with a com- 
plete numerical example (for 2 ; =3 and k = .009) in H :171 :151-166. 
A numerical illustration where 2 = 2 and k = ^ is also set out in 
P:^;^:9-10. 

More recently the correction of these initial terms has been 
examined in P:jf 54:409-413, P:55:51, and P:P:52, and a mathe- 
matical summary of a method proposed by Henderson is stated 
in F:59:28j par. 4.2. 

A useful r6sum6 of the 2nd difference case ( 2 = 2 ) when w = l 
(k =i)or M =3 (k =^V)» 2 ind of the 3rd difference case (z =3) when 
« = 1 (fe=f|) or w=3 (fe = .009), is given in P:77:512-514, with a 
convenient form of card for performing the calculations. 

[Note that when 2 = 2 the difference equation is 

Hfr -fr) +AVr-* =0, Orf' ^f'r+~ ^^fr-2, 

k 

and when 2 = 3 it is 

k{fr-fr) - = 0 . Orj'r =fl - \ •] 

k 

This method is sometimes referred to as the “Whittaker- 
Henderson Formula A’', from Professor Whittaker’s first enun- 
ciation of the principles and Henderson’s subsequent solution. 

{iv) In applying this difference-equation method to the grad- 
uation of mortality rates, Henderson (P:57) has suggested that, 
instead of using the rates of mortality in the fitting portion of 
the function by taking/,. =g,- and/y=g^, it may be advisable to 
make allowance for the extent of the data at each age by weight- 
ing those rates approximately with so that the fitting portion, 
2(/r*~/r)^i might be taken as 

2 [Kiir-q'rY] =2 \_E', {qr- 0] . 

Moreover, when 2 = 3 the effect of the smoothing term is to make 
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the graduated values close to a series with constant 2nd differ- 
ences, whereas at the younger ages the 2nd differences should 
increase; and this discrepancy might be corrected by adding 
.l(A®gi)2 at the youngest age. Under these particular conditions 
the whole blended function to be minimized would become 



+ 2 (A»g')* + .l(A2?i)*. 


By differentiating with regard to the unknown and equating 
to zero, as before, a set of simple equations equal in number 
to the unknowns emerges as stated in P:5P:40 (see also P:JSS:61). 

This method is sometimes referred to as the ^‘Whittaker- 
Henderson Formula B”. The form of calculation is indicated 
in F:59A1, While it avoids the difficulty with the initial terms, 
the arithmetical work is heavy. The selection of k may be made 
from the fact that in the Formula A method (par. (Hi) preceding) 
^ = .01 usually gives a satisfactory graduation, and since this 
factor is replaced by kEr in the B method it follows that kEy 
should be about .01 over the range of ages where the data are 
heaviest (P:f 15:289) ; or k may be taken as inversely proportional 
to the square root of the maximum or average Ey (P:55:43). 

(v) At almost the same time as the appearance of Henderson’s 
solution, Aitken (as already noted in par. 3) examined the differ- 
ence equation in linear compound form (H :152), When z =3, for 
example, the equation (see par. 3) is k/y -—kfy = or 


kfy ~ (fe — A*£ ^)/y , where £ = 1 + A, 


whence /^ = ^1 - fr 


and the linear compound follows immediately by expansion (see 
also P:1S4:418,443, and517). Although these linear compounding 
coefficients, which are symmetrical, evidently cover the whole 
range from r = 1 to r = j' and also continue indefinitely but dim- 
inish rapidly beyond each end (instead of dealing only with a 
limited range as in the formulae of section (VII) here) it is to be 
noted that they follow closely the curve first indicated by 
De Forest for the case of repetitions of a limited-range formula 
(see par. 2). 

Aitken calculated the linear compounding coefficients when 
2 = 3 for certain values of k (being € in his and Whittaker’s nota- 
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tion), which for fe = .01, .02, .05, .1, .25, and 1 are given also in 
P:f:34. When s = 3, values for other convenient special cases 
fe =.46296, .12738, .04537, .01906, .009, and .004639 are to be 
found (in the columns headed kf) in P:^54:457-462. When z = 2 
the coefficients for k = \ are worked out similarly in P:7;^:516 
and 517. 

Aitken further established the important and simple fact that 
an accurate solution may be reached by extending the ungrad- 
uated data, /y, for ( 2 +I) terms at each end (in order to determine 
the required terminal zero values of A*/y) by the use of a set of 
unsymmetrical coefficients (depending on the value of k), so that 
at each end the constant values of given by these extended 
terms may then be used to build up easily as many more extended 
terms as may be required to permit the application of the sym- 
metrical linear compounding form over the whole range of the 
data. The method, and the unsymmetrical coefficients for the 
extensions at each end when z = 3 and fe = .01, .02, .05, .1, .25, or 1, 
are given clearly and accessibly by Aitken in P:i:31-36; the 
mathematical analysis, and the coefficients for the extensions 
when z = 3 and k has the particular values .46296, .12738, etc., 
noted above, are available conveniently in P:Jf 54:423-5 and 457- 
462. A complete numerical illustration may be found in F:l: 
34-36. 


(vi) From the preceding development of the linear compound 
form it is clear that a closely approximate solution will be reached 
by using only the significant portion of the expansion. Davidson 
and Reid (P:^4) accordingly investigated the formulae of 

^ k — / 

1 -| — \ , and throwing the linear compound so 


retained into the ^‘summation*’ form of par. 4, section (VII) here. 
They gave (loc. cit., 6) both the linear compound coefficients and 
the equivalent summation formulae for 17, 21, and 25 terms with 
ifc = .01, .05, .06, .1, and .2, and also (loc. cit., 12) summation 
formulae of 15, 17, 21, 23, and 27 terms when k “ .02, .05, .1, .3, 


and .8. 


11 
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(vn) Finally, an excellently compact method of solution, 
which also has the great merit of accuracy and simplicity, has 
been evolved by Spoerl (P 34*414-6). He gives a systematic and 
adequate process of finding z additional terms at each end by 
applying Aitken's unsymmetrical linear compounding multi- 
pliers, and then computing any graduated term directly and very 
easily from the original data and the 2z added terms only, by 
applying the linear compound formula to the original series of 
data, and also z special multipliers to the z added terms at each 
end. A clear summary of the method when 2 = 2 is given by 
Spoerl at P:i«:616. 

The mathematical analysis of Spoerl’s formulae is treated 
fully in F:1S4A25‘7 and 443-9. The multipliers required in the 
work are tabulated in P:lS4:456-462 for 2 = 3 and ^ = .46296, 
.12738, .04637, .01906, .009, and .004639, and at P:i3:516 for 2 = 2 
and ib = |. 

SpoerPs process is considerably shorter than Aitken's pro- 
cedure of par. (v), since it requires at each end the calculation 
and use of only z added terms, whereas Aitken’s method employs 
a long extension of the original data depending on the value of k, 
SpoerPs method, moreover, has the very important practical 
advantage that when the 2z additional terms have been com- 
puted, any desired graduated values at selected points (such as 
every 5th or 10th value) can be found with the greatest ease in 
order to see whether the graduation is likely to be satisfactory 
with the particular z and k adopted. 

(IX) The Graduation of ^^Select” Mortality Tables 

When it is desired to construct a “select’* mortality table, in 
which the rates of mortality will be tabulated for a certain 
number of years of duration, /, since entry at select age [jc], the 
problem of graduation is to adjust the table consistently for both 
the variables, and to run the values of the select period smoothly 
into the “ultimate” rates. Using the standard notation already 
stated on p. 296; C; 11, tjie following special methods have been 
employed on various occasions. 
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(а) With Makeham^s Formula 

Since the property of “uniform seniority*' (see p. 319; C; 18) 
in Makeham*s formula (83) depends only on the constant c, 
G. F. Hardy preserved its applicability to select tables (H:7S: 
359; H:7<?:493; H:5>0:126, 134-5, and 158; H:5/5:508) by writing 
-A +/(/) +[5 +^(/)]c®'^' = (say) A t +Bt or correspond- 
ingly cologio p[x]-\-i = o,t Treating separately the data for 

each year of duration, t, the values of At and Bt (or at and fit) are 
first determined by one of the fitting methods of Chapter VIII. 
These values, however, will generally require some further adjust- 
ment in order to secure consistently smooth junctions between 
the select and ultimate portions of the tables. The shape of the 
curves is shown at P:S1 :74-5; the particular expressions used by 
Hardy for their representation are stated in H :90:135 and 158; 
H:^^:508; and P:51:76-7. His methods of determining the con- 
stants therein are referred to in H:P0:127 and 157-161, and 
H:55:507, and are explained more clearly in H:105:292-5 and 
322-3. 

Hardy based much of his analysis on the fact that when 
Hx+t^A+Bc^'^*, and U[x]^t^A+f{t)+[B+ip{t)]c^’^^, it follows 
immediately by integration that 

logio /(*!+» = logio /*+» - , - +C*F'(0]. 

log.lO 

Similarly in terms of the constants of colog p the formula may 
be written logio/[*i+e = logio/x+«““/*’“i^^®^«- It should be noted 
that the relation /3i = [l— 2n(10— /)c~*]i3 given in P:^7:77 ought 
to be fit — [l—n{19 — 2t)c‘^^]fii as pointed out in H:n^:474. 

(б) With the Graphic Method 

{i) The process adopted originally by T. B. Sprague for the 
H^^^ data was firstly to graduate gjxj. For the first year of dura- 
tion he then graduated the unadjusted for each value of rv, 

?Ix] 

and multiplied the graduated value of this ratio by the gradu- 
ated g[a 5 ] to obtain the required value of g[»i+i. A similar method 
was used for each subsequent year of duration (H:^/ :244-6). 
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(fi) An alternative method used by J. Chatham in an experi- 
mental adjustment of the British Offices* Annuitants* Experience 
was to graduate q[x] first, and then for each value of x to graduate 
the series g(x] graduated, q[x]^u 5 i*i 4 - 2 , . . . ungraduated 

{Hi) As noted at p. 296; C; 11 here, Jastremsky concluded in 
1912 from an investigation of Austro-Hungarian mortality that 

the ratio — might be considered to be independent of the 
Qx 

attained age x, which means that q[x\-{-t—kt q^x+t where kt depends 
only on the duration /, and n denotes the number of years in the 
select period. In the Japanese Three Offices* Tables (see P:7>^: 
103) it was similarly thought justifiable to assume that 5[*i = 
•62gi®\ g[*j+i = .875^^1, 5 [x]-f 2 = .95gi?^2. and = .97gi?^3 and 
that when / = 4 or more. The rates during the 10-year 

select period of the O^^^ experience, and in other data also, like- 
wise approximate closely to the same type of relationship except 
for the first year of duration (P:7<9:332 and 336). George King 
therefore suggested (P:7<^: 289-292, et seq.) that a short and 
often satisfactory method of determining the graduated values 
during the select period would be simply to apply such factors, 
kt, to the graduated ultimate rates — the values of kt being 
adjusted, graphically or otherwise, as might seem necessary. 

(iv) Another ratio process (P:i^P:136) employed in the 1926 
Life Tables of the German Life Assurance Companies was first 
to graduate the ultimate rates of mortality, q^, and also q[x]; 
then the data for each separate year of duration were divided into 
five sections, each embracing 9 ages attained, from age 20 to 
age 64, and interpolation factors where a denotes the 

number of ages in the group, were computed as 

a a 

•VO _ 2 E,[x-t\ +* 2* ~2 E,<x-t]+t 2(*— n+« . 

A (*-1)+* 1 , 

2 ■£[*-«]+« 3 »— 2 q[x] 

these factors were next graduated horizontally and vertically; 
and finally they were calculated for each age, after the average 
age to which they applied had been determined by weighting. 



An Outline of a Course in Graduation 


147 


(c) With Finite Difference Interpolation Methods 

The use of elementary finite difference formulae to obtain 
smooth junctions between the various sections of the data was 
discussed by George King (P:7^a:108 et seq. — see also P:1 ^7:247) 
as a development of his method already noted in (6) (in). 

(d) With Linear Compounding {or Summation) Methods 

A useful device for the graduation of select tables by these 
methods is described conveniently in P:5S:62. 

{e) With Difference-Equation Methods 

A process based on the simplest form of the Whittaker- 
Henderson A method is given by Wells in V\153. 

(X) Conclusion 

From the number and variety of the methods and formulae 
noted in this outline the student will readily appreciate that there 
is no “royal road” to knowledge and proficiency in the field of 
graduation. The selection of the process to be tested in any 
particular case will depend partly on the character and extent of 
the material, which obviously may eliminate certain procedures 
as inapplicable or clearly inadvisable; the method should be 
determined also with due regard for the practical situations in 
which the graduated results will be employed; and it is not 
scientifically inappropriate to add that the personal preference 
of the graduator may be allowed to have some influence upon 
his choice. All these considerations, however, must in the end 
be circumscribed by the requirement that the numerical results 
must satisfy those tests of fit and smoothness which shall be 
thought suitable. 

The student accordingly should not accept any argument — 
however persuasive — or any opinion — from whatever source — 
which claims preeminence or universality for any special process. 
No single method has yet been proposed which can rightly claim 
an ability to reach a best, or a satisfactory, result under all cir- 
cumstances, or even under most; and no such method is likely 
to be proposed. The student of graduation should therefore be 



148 


An Outline of a Course in Graduation 


very careful to preserve an impartial and wide outlook, so that 
under any set of practical conditions he may select, apply, and 
test those methods which, on close consideration, appear to him 
most suitable for dealing with the case in hand. 



SECTION 

A 

HISTORY 




A ; 1. Stirling’s Formula 

Stirling’s formula (9) was published in 1730 (H:0:137). 
De Moivre in 1718, in the first edition of his ‘'Doctrine of 
Chances” (H:5), nearly reached the formula; in his second 
edition (1738) he credited the completed expression to Stirling, 
who had obtained it by using Wallis’ formula (P:80:67, and P: 
36:90) which gives a relation, ii> the limit, between tt and factor- 
ials The usual proof, which is lengthy, need not be 

reprinted here; it may be found in P:i55:138, P:80:65, or P:146: 
349, while a shorter demonstration by Cesiro, based on inequali- 
ties, is given in P:5^:93 (see also 

Since the relative error involved by dropping the term — 
8 

is only about - per cent, the formula is very generally used simply 

as 27rn This elegant expression shows remarkably 

accurate results, even for small values of n — for example, its 
value when w=5 is 118.1, whereas 5! = 120; when iv—\0 it gives 
3,598,699 in comparison with 101=3,628,800. 


A; 2. The Discovery of the Normal Curve 

Until recently the names of Bernoulli, Laplace, and Gauss, 
either separately or in combination, have been associated with 
the functions (10) and (11). In 1924, however, Karl Pearson 
(P:PP) announced his discovery of De Moivre’s remarkable 
“Approximato ad Summam Terminorum Binomii a+b ' in 
Seriem Expansi” (H:7), which was published in 1733. In that 
work — of which two copies only are extant — the first statement 
of the Normal Curve is clearly given (see P:f 4^:13-18, P:116: 
47, and H:164:566 for the complete English version). 


A; 3. History and “Proofs” of the Normal Curve 

This Normal Curve, and the Theory of Errors and Method 
of Least Squares (to be discussed later) which resulted from it, 
presented an effective process (highly systematized by Gauss 
especially, and by astronomers and physicists in several coun- 
tries) for dealing with the “errors” of unbiassed observations 
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made by precision instruments and skilled observers. The tech- 
nique so developed still has that same utility. 

In view of this wide use in an era of expanding mathematical 
research, it is not surprising that the genesis and symmetry of 
the curve led for many years to a search for other proofs which 
might release the derivation from any dependence upon the a 
priori supposition of repeated trials, and would instead permit 
the formulation to be based upon the concept of ‘‘errors of 
observation**, measured from an average or mean, as such 
“errors** might occur in nature. Some of the “proofs** were 
quite unsatisfactory, others were more rigorous; all, however, are 
interesting still in the history of thought, and should not be 
treated lightly by any student who wishes to grasp thoroughly 
the full significance of the development of more recent methods. 

The search for a convincing “proof** was accompanied often 
by the belief that all mass phenomena would show variations 
occurring symmetrically about a mean, and that the Curve of 
Error (11) was the expression of that fundamental law of nature. 
It was not realized for many years that the quest for uniformities 
in nature is not necessarily predicated best on an essential sym- 
metry; nor was it appreciated fully that the Normal Curve would 
fail as a mode of representing many types of chance distributions. 
When, however, attempts were made, by Quetelet and others, to 
apply the method to problems of biology and sociology, it was 
found that the statistics frequently exhibited a defiance of the 
“Normal Law**. At first this was thought to be merely the result 
of paucity of data; but later it was recognized to indicate the 
existence of skew variation even in ample and homogeneous 
material. When these facts did finally appear they led quickly 
to the development of unsymmetrical (skew) curves in addition 
to the elegant symmetry of the “Normal Law**, and thus created 
a much wider and more general theory (see also P:jf 4^:17 and 
28-49, and F:S6:U9 and 179-181). 

The very great importance which was thus attached to the 
discovery of the “Normal Law*’, and the numerous attempts to 
place it upon a logical foundation, are illustrated by the following 
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summary of the **proofs**, ‘‘explanations’’, and discussions which 
were published by many of the greatest mathematicians between 
1733 and 1872. 

(i) In 1733 the original discovery was made by De Moivre 
(see p. 151 ; A; 2) although it does not seem to have attracted any 
attention, and later investigators proceeded without any know- 
ledge of it. 

(ii) In 1778, Laplace appears to have recognized the existence 

of the formula in an evaluation of (seeP:145:18 and20), 

Jo 

and in his later work showed clearly that he had anticipated 
Gauss {F:80:8). His method (see and the English trans- 
lation in H :164 :588) , which was based largely on the view that an 
“error” may be supposed to be produced by the combination of 
a vast number of very small independent “elementary errors”, 
and on the principle of minimizing the mean value of the “error” 
thus committed, was improved and extended by many subse- 
quent writers — notably Poisson {¥1:19) and Ellis {li:26 ) — and 
is well re-stated by Airy (H:Sj^:7), Glaisher (¥1:46), Todhunter 
(H;SS;464), Whittaker and Robinson (P:7J^:168-173), and partly 
by Arne Fisher (P:S^:197). Notwithstanding their profoundly 
important character, to which all later investigators, to this day, 
pay tribute, the great difficulty of mastering Laplace’s intricate 
analyses caused his methods for many years to be largely super- 
seded by the less imaginative procedure of the German Gauss. 

(iii) Robert Adrain in 1808 was led independently to its 
discovery. The two deductions which he gave (H :13), however, 
cannot be considered satisfactory (see H :46, and P:1 4^:20 and 94). 
The second was essentially that given much later by Herschel 
(see (vii) here), and is open to the same question with regard to 
the validity of assuming the probabilities of the x and y devia- 
tions as independent (H:5S). 

(iv) In 1809, Gauss — acknowledging his indebtedness to 
Laplace (see P:149:22 ) — published his first “Theoria Motus” 
(H:i4-§177) derivation (for which see P:156:218, P:^0 *.118-120, 
and P:13:22), involving the “postulate of the arithmetic mean”, 
i.e., that when any number of equally good direct observations 
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of an unknown quantity x have been made, the ^‘most probable*' 
value is their arithmetic mean. 

The precise extent to which Gauss really depended on this 
postulate, and the postulate itself, have been subjected to critical 
discussion for many years. In 1834, Encke attempted unsuc- 
cessfully to prove it In 1844, Ellis supposing 

that a is the true value, and Xu X 2 , . • » the n observed values with 
errors Cu ^ 2 , . . . , so that xi—a=eu etc., pointed out 

that the rule of the arithmetic mean is easily deducible from very 
simple a priori considerations — for if, in the long run, there is no 
permanent cause tending to make the sum of the positive differ 
from that of the negative errors, then 2e=0, that is 2(jc--a) =0, 

or a = ^^x, namely, the arithmetic mean. But he also drew 

attention to the important fact that these suppositions are in 
reality too simple — for, instead of only 2^ = 0, we could have 
2/(e) =0 where /(e) is any function such that /(e) = —/( — e), that 
is to say we could have 2/(^c— a)=0 — “and no satisfactory 
reason can be assigned why . . . the rule -of the arithmetic mean 
should be singled out from the other rules which are included in 
the general equation 2/(ai:--a) =0", for “we are perfectly sure 
that in different classes of observations the law of probability 
of error must vary”. He laid stress also on the ambiguity of 
the words “most probable” in Gauss* treatment — remarking that 
“there is no reason for supposing that because the arithmetic 
mean would give the true result if the number of observations 
were increased without limit, it must give the ‘most probable* 
result, the number of observations being finite”, for “by losing 
sight of this distinction we are led to the inadmissible conclusion 
that a principle recognized as true a priori necessarily implies a 
result, viz., the universal existence of a special law of error, [which 
is] not only not true a priori, but not true at all” (cf. also his 
further remarks in Y{:27). 

Glaisher remarked in 1872 that “what was in effect Gauss* 
view, viz., that the arithmetic mean is practically the best mode 
of combining simple observations . . . , was quite reasonable and 
consistent — but he was very far from asserting that the arithmetic 
mean is the most probable value of the quantity observed” (H :4d). 



History 


165 


More recently the Italian mathematicians have re-examined 
the axiomatic foundations from which the postulate may be 
deduced (see P:1S5 :215-7), while for certain types of observations 
it has been shown to be invalid (see P:M5:217-8). It must be 
added that Gauss himself did ngt present his method as ‘‘other 
than tentative and hypothetical” (H:^^), and that subsequently 
he expressed a preference for his second proof in (v) below (see 
P:Jf55:224 and 228). 

(v) Although it concerned more directly the establishment 
of the “Method of Least Squares”, Gauss’ “Theoria Combin- 
ationis” (11:17) investigation must be included here. Following 
largely the methods of Laplace, and thus abandoning the neces- 
sity of introducing the “postulate of the arithmetic mean”, he 
based his treatment on the principle of minimizing the probable 
value of the square of the error (instead of Laplace’s postulate 
that the importance of the error is proportional simply to its 
magnitude) — see H:26, and P:165:227-S. In illustration of the 
various opinions which these several treatments have evoked, 
it may be noted that Ellis (H:^6) observed that “nothing can 
be simpler or more satisfactory” than Gauss’ “Theoria Combin- 
ationis” demonstration; Glaisher, however, differed, saying 
(li:46) “with this remark I cannot at all agree”; Merriman 
(H:5S) and others have expressed the opinion that, in its rela- 
tion to the Method of Least Squares, “it is but little more than 
a begging of the question to assume that the mean of the squares 
of the errors is a measure of precision”; and Crofton (H:42:IS3) 
stated that “it is of infinitesimal importance whether, with 
Laplace, we estimate the importance of an error by its mean 
value (irrespective of sign), or, with Gauss, by its mean square” — 
for both approaches lead to the same result (see P:755:227-8). 

(vi) In 1837, Hagen (H'£2), again basing his treatment on 
the assumption that an “error” may be viewed as the sum of a 
large number of infinitesimally small errors, deduced the Normal 
Curve by a method which has been adopted for text-book pre- 
sentation by Merriman (P:^0:17) and Brunt (P:13:ll ) — the 
latter giving also a generalized proof due to Eddington (P:^5:15). 

(vii) In 1850, Sir John F. W. Herschel put forward inde- 
pendently a different form of demonstration (11:38), apparently 
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without the knowledge that Adrain had employed essentially 
the same method in 1808 (see (iii) here). Supposing **a stone 
to be dropped with the intention that it shall fall upon a given 
mark/' any ''deviation from this mark is error" which may be 
expressed by the function /(r^) or In his first presen- 

tation he proceeded by means of an assumption to which Ellis 
objected strongly in a paper (H \27) written primarily as a criti- 
cism of Herschel’s suggestion, namely, that an observed devia- 
tion, being equivalent to two deviations parallel respectively to 
the co-ordinate axes, ''is a compound event [so that] its proba- 
bility will be the product of their separate probabilities". Ellis’ 
criticism was that ''it is not true that the probability of a com- 
pound event is the product of those of its constituents unless the 
simple events . . . are independent of one another, and there is 
no shadow of reason for supposing that the occurrence of a 
deviation in one direction is independent of that of a deviation 
in another". Boole, however (in the Trans. Royal Soc. Edin- 
burgh, XXI, 628), attempted to relieve the demonstration of 
this difficulty by asking at the outset that "it be assumed that 
the actual deviation is a compound event", and that the two 
component deviations "are independent events"; and Herschel 
himself later (in 1857, in a footnote to a reprint of the greater 
portion of his article in J.I.A., XV, 179), took the position that 
"the increase or diminution in one [component] may take place 
without increasing or diminishing the other" — adding that "on 
this, the whole force of the proof turns". Again in 1870, Crofton 
(H:4^) characterized these assumptions as "bold", and held 
that the method "can hardly be seriously viewed as a demon- 
stration", while Glaisher (H:^^) referred to "the unwarrantable 
character of the assumption of equally probable x and y deflec- 
tions". It has, however, been used in a number of text-books, 
with further explanations as to the import of the assumptions 
made — e.g., in Thomson and Tait’s "Natural Philosophy", and 
by Brunt (P:1S:14), Levy and Roth (P:<90:121), and Scarborough 

(viii) Sir George F. Hardy, in P:5f:5, has more recently 
suggested that "we may, perhaps, see a logical basis" in the 
supposition that a "deviation" (from the mean) results from an 
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infinity of minute superimposed causes, of which the nature is 
unknown although ‘*any one [of them] may produce a minute 
positive or negative deviation from the average . . . [and] we may 
without loss of generality assume them of equal magnitude*’. 
Then, *'if the number of possible causes of deviation is 2n, and 
if the extent of each indefinitely small deviation is k (n being 
indefinitely large, but k y/n finite), the probability or frequency 
of a total deviation lying between x and x-\-k will depend on our 

having positive values of k and (”- 5 ) negative values. 

The probability of this occurring will be represented by the 
appropriate term in the expansion of the binomial 

Of/ c — 7 ^ (i)*" ” This expression, 




n being indefinitely great, takes [by Stirling's Theorem (p. 151; 

1 ^hhL 

A; 1 )] the form je « , which is the Normal Curve (11) 

(nrr)* 

X 

when c is written for Wn amd x for — . 

2k 


The intellectual struggle which revolved about the validity 
and consequences of the ’‘postulate of the arithmetic mean”, and 
the attempts to establish the universality of the Normal Curve, 
comprise a most illuminating chapter in the history of philosophic 
thought and the gradual evolution of critical analysis. It will 
repay any student well to picture De Moivre — living in London 
as a fugitive from France — enunciating, with then no recognition, 
his remarkable discovery; to realize the extraordinary elegance of 
Laplace’s analytic power, and the stimulus of his great mind 
which shows so clearly in the work of Poisson and the later 
Frenchmen; to follow the careful and systematic Gauss — born 
humbly, like Laplace — and the contributions also of his German 
students and compatriots — Bessel, Encke, and Hagen; to eval- 
uate the interpretations of the English school — Ellis and Glaisher, 
the critical De Morgan, and Todhunter; and to appreciate the 
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great practical importance of the sequelae and wider theories 
which in later years have been developed by the Russian, Scan- 
dinavian, and English writers — indebted still, as they must 
always be, to Laplace and Gauss above all others. Much will be 
lost by any student who essays now to grasp the technique of 
this subject without a close examination of its history. The 
philosophic contemplations — even the religious questionings — by 
which its evolution has for so long been accompanied have repre- 
sented largely the soul of man’s search for understanding. The 
material assembled here has accordingly been given in the hope 
that even so mathematical a subject may continue to challenge 
the reader with something of its romance. 

A; 4. Early Attempts to Establish a “Law of Error” 

It may be of interest to note that the varied deductions of 
the Normal Curve by De Moivre, Laplace, Gauss, and others, 
as set out in A; 3, were not accomplished without other sugges- 
tions being made which led to entirely different formulae. 

It must be realized, in the first place, that the problem is, 
fundamentally, that of finding an equation representing the 
actual error resulting in a function, in terms of the component 
errors to which its several independent variables may be subject. 
Mathematically, if a quantity Q (such as a rate of mortality) be 
a function /(v, w ,.2 . . . ) of several independent variables v, w, z, 

. . . (such as age, height, domicile, . . . , here assumed to be 
independent), which are subject respectively to component errors 
Av, Aw, Az, . . . , then the resulting error, say AQ, in Q will be 
given by/(t;+Ar, w+Aw, z+Az, . . .)— /(v, Wj z, , , .), that is, by 

^ ^ Az+. . . (from Taylor’s Theorem) if the 

component errors are sufficiently small that the terms involving 
their squares and the differential coefficients of second and higher 
orders may be neglected. The “law of error’’ thus emerging, say 
expressing the relative frequency of the occurrence of 
the resultant error or AQ, in Q, will obviously take a form 
which is dependent upon the laws of propagation of the various 
component errors. 
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With this principle clearly in mind, therefore, it is not sur- 
prising that very early — in 1755 and 1757 — ^Thomas Simpson 
(H:P), assuming simply that errors of different magnitudes are 
equally probable (as in the case of the errors of tabular logar- 
ithms, etc.), so that <p{x) found in that case a rectangle (see 
P:i 73:32). He likewise proposed an isosceles triangle when the 
probability of a positive error is (p{x) = —mx+c, and positive and 
negative errors are equally likely — a case discussed also by 
Lagrange (H:3S:309, and V:165A&1). 


In 1778 Daniel Bernoulli, objecting to the common use of 
the arithmetic mean as assuming that all the observations are of 
equal weight, supposed that small err ors are m ore probable than 
large ones, and proposed ^(^c) = +\/ where r is a con- 
stant — thus reaching a semi-circle (H:33:237, and P:1 73:33). 
Laplace himself brought out and then discarded several others, 


such as 


1 , X 
log,- 


_ , and ~e where \x\ indicates that x is 

2a a 2 ' ' 

always taken positively. This last expression, suggested in 1774, 
is usually referred to as Laplace's First Law of Error, and results 
from assuming that the median (the middle value in a sequence 
arranged in order of magnitude), rather than the mean, is the 
most probable value of the unknown quantity (see F:155 :188 and 
P:^3:27). Examples have been given very recently (P:I57) of 
certain data thus distributed, with the suggestion that the appli- 
cability of the curve may be tested by plotting on “semi-logar- 
ithmic” paper (sometimes called “arith-log” or “ratio” paper, 
on which a plotting gives x and log y instead of x and y), for then 
the points by this formula lie on a straight line. It has also been 
noted by P. R. Rider (P:109) that this “First Law” of Laplace is 
the case when m = l, and the Normal Curve when w=2, of a 


'generalized law”' 


w' 




2T 


© 


e , although in practice “it would 


doubtless be sufficiently accurate in most instances to use w = l 
or m~2”. 


12 
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Under yet other circumstances, such as where the quantity 
Q, or /(t>, w, z, . . .)» is known to be determined by some definite 
relation, for example, it is clear that the law of the 

propagation of errors in Q is fixed by the laws governing the 
generation of errors in v and w. It must be realized, therefore, 
as Woodward states succinctly (P:f 73:31), that “every investi- 
gator in work of precision should have a clear notion of the error- 

equation l^of the type AQ=^Av+^Aw + . . . appertaining 

to his work — for it is thus only that he can distinguish between the 
important and unimportant sources of error”. It will also be 
evident from these considerations why it is, when the quantity Q 
is related to a very large number of independent variables, instead 
of only two in such a case disQ=kv above, that the resultant 

errors in Q may be visualized as fortuitous — for then they can 
hardly be determined by fixed conditions, and indeed in many 
cases their propagation can only be imagined as the combined 
operation of innumerable small influences of closely equal effect, 
for each of which some reasonable assumption, in the absence of 
knowledge, would appear to be legitimate. This, of course, is 
the hypothesis by which the Normal Curve is approached in 
many of the deductions given in A; 3. The train of thought here 
indicated may also emphasize the importance of the statement 
often made, that the failure of some observed functions — partic- 
ularly in economic and sociological data — to follow the Normal 
Curve is probably due largely to the predominance of relatively 
few influences out of the many which determine the values of 
the function. 


A; 5. Tabulations of the ‘‘Probability Integral” or “Error 
Function” 

Because of its great importance in the theory of errors of 
observation, numerical tables of the area of the Probability Integ- 

2 f* 

ral or Error Function, — e’^^dt, and of the ordinates, have 

V** J 0 

been prepared by many computers. For an account of the var- 
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ious tables which have been published since Kramp’s first tabu- 
lations of 1799, see P:f 4^:58. 

The methods of computation depend on the expansion of 
and its integration for small values of and on an asymptotic 
series when x is large (see P:JfJ^:179, P:iS:19, and F:122:297). 

The tables are to be found in several different forms (see 
P:ff4-14), which must be watched carefully in practical appli- 
cations. One method arises by substituting c = y/2npq = <r\/2, so 
1 „ 

that (11) becomes — e Another form frequently 

<7 V 27r 

adopted results from using <7 as a unit of measure, thus obtaining 
e 2 . Convenient arrangements of extensive values 



have been given in comparatively recent years by W. F. Sheppard 
(P:97 and F:1^9), and by J. W. Glover (P:47:392). Actuaries 
will find short tables of the areas easily accessible in P:i 7^:50 and 
P:51 :138, and in most text-books on statistics, while the ordinates 


may be found in P :^7 :397 and P:16 :384. The following specimen 
2 f * I 

values of dt^ Erf (x) will give an idea of the areas: 

y/ir Jo 


X 

Erf(jc) 

X 

Erf(3c) 

X 

Erf(x) 

0.10 

0.11246 

1.10 

0.88021 

2.10 

0.99702 

0.20 

0.22270 

1.20 

0.91031 

2.20 

0.99814 

0.30 

0.32863 

1.30 

0.93401 

2.30 

0.99886 

0.40 

0.42839 

1.40 

0.95229 

2.40 

0.99931 

0.50 

0.52050 

1.50 

0.96611 

2.50 

0.99959 

0.60 

0.60386 

1.60 

0.97635 

2.60 

0.99976 

0.70 

0.67780 

1.70 

0.98379 

2.70 

0.99987 

0.80 

0.74210 

1.80 

0.98909 

2.80 

0.99992 

0.90 

0.79691 

1.90 

0.99279 

2.90 

0.99996 

1.00 

0.84270 

2.00 

0.99532 

3.00 

0.99998 


Since it is frequently desirable to have a ready means of 
referring to the ‘‘error function” for particular values of the 
limit Xf the custom has arisen of using Erf x or Erf(jc) to denote 
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I dt. Thus Erf(a))=l; Erf(0)=0; formula (16) is 
V Jo 

-f ( ^ ] = .5; the probability that a deviation will lie in the range 

1 X 

dz<T is, by (11), e dx^ which, by putting — =/, be- 

cVtJ-0 c 


Erf 


comes 


2 ri/V2 


dt == Erf (l/\/2) = .6827; and so forth. 

V^Jo 

Similarly, the probability that a deviation will not exceed =fcd 
(i.e., d in absolute magnitude) is 

J_ = (J-); 

's/ttJo \(t\/2/ 

consequently, the complementary probability that a deviation 
will not be less than dbd (i.e., will be as much as, or more than, d 

in absolute magnitude) is 1— Erf ^ ) * 

\ay/2/ 


It is likewise sometimes convenient to use Erf”^(jc) as an 
operator (like sin”^ic), to be read^as *'the number of which the 
error function is x'\ 


A; 6. The Nomenclature Associated with the Normal Curve, 
and its History 

Considerable variation, unfortunately, is to be found in the 
English terminology associated with the Normal Curve, and the 
student therefore must be careful to identify the names used in 
any of the classical — and even in some of the modern — ^works 
which he may peruse. Detailed examinations of the discrep- 
ancies, and the origins of the terms, are given in P:f 4^:49 and 176. 

Attention may here be drawn to the following matters in 
particular: 

(i) Although the “standard deviation “ is the usual name for a 
— the square root of the mean of the squares of the deviations 
(or “errors”) measured from the mean — many Continental 
writers (following Gauss) refer to it as the “mean error”, which 
is certainly not descriptive, and creates confusion with the 
“average or mean error (irrespective of sign)” of formula (13), 
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(ii) The “standard deviation” was called by Lexis the “dis- 
persion”. 

(iii) The term “variance” for the mean square deviation — 
being the square of the standard deviation — is now used exten- 
sively by R. A. Fisher and his followers. 

(iv) Brunt (P:/S:29) adopts a confusing procedure in using 
“mean square error, or M.S.E.” for the standard deviation. 

(v) The “probable error” — unquestionably a misleading 
phrase — is gradually being supplanted, although it is still widely 
used. 

(vi) The parameter c in (11) has been called by some writers 
(notably Airy) the “modulus”. 

(vii) By putting ~ =A, (11) becomes f(x)= in 

C y/'K 

which A is often called the “precision” (particularly by astron- 
omers and physicists) since the cluster of the values about the 
mean becomes closer as h increases. 

(viii) The “weight” is usually defined, in the classical texts, 
as the reciprocal of the square of the probable error — or, as in 
this study (see (107) and (108) in Chapter VIII) as the reciprocal 
of the square of the modulus c, the probable error X, or the stan- 
dard deviation < 7 . Woolhouse (P:I74*-46) and G. F. Hardy {F :51 : 
118), however, define the weight as the reciprocal merely of the 
probable error (not of its square) — a procedure which must be 
remembered carefully in connection with their development of 
the “normal equations” in the Method of Least Squares (see 
p. 323; C; 20). 

In this study the terms adopted are those now most generally 
employed in the literature with which actuaries are mainly 
concerned. 

A; 7. The Lexis Theory 

The publication by Lexis in 1877 of his method of 

analyzing “dispersion” (as he called the standard deviation) 
forms an outstanding landmark in the development of mathe- 
matical statistics. It is of interest for actuaries to note that his 
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theory was devised with particular reference to variations in 
mortality and in the sex ratio at birth. 

Charlier, of Sweden, illustrated the fundamental Lexis theory 
and the ‘^Lexis ratio” by a wide variety of card-drawing experi- 
ments (see P:S^:137), and developed the ”Charlier Coefficient of 
Disturbancy”. 

Since their original statement the ideas of Lexis have under- 
gone much elaboration and many refinements, which are now 
embbdied in the modern techniques associated with the test 
(see Chapter IX) and the “analysis of variance” (see Chapter X). 


A; 8. Bessel’s Formulae 

The use of the factor in (42) is generally referred to as 
n — 1 

“Bessel's Correction”, after the astronomer Bessel (1784-1846), 
who in 1815 originated the term “probable error” and contributed 
largely to the theory of errors and least squares (see P:7^5:24 and 
186). The precise history of its first derivation has sometimes 
appeared doubtful, and it has been ascribed variously to Bessel, 
Gauss, and Encke (P:^^:116 and 145). It now appears to be 
clear, however, that the method should be credited to Gauss, 
since he set forth the correction in H:f7: art. 38 (see P:i^5:18). 

In the classical presentations the formula is often stated in 
terms of the probable error rather than for (see, for example, 
P:^55:206). The student will also meet there a corresponding 
expression in terms of the deviations without regard to sign, 
which is known as Peters^ formula (see P:f 55:206, and P:fS:38). 

The denominator (n — 1) in reality allows for the loss of one 
“degree of freedom” in thus estimating from the data. The 
comparable formulae when there are k unknowns (see p. 175; 
A; 18 and p. 250; B; 26) which are treated in the classical dis- 
cussions of the Method of Least Squares will be found similarly 
to employ a denominator (w— Jfe) when there are k “constraints” 
(P:ff0:59, 82, and 86, and P:j?S;18). The principle involved in 
the concept of degrees of freedom is accordingly (as stated at 
p. 176; A; 19) to be credited in the first instance to Gauss. 
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A; 9. The Bayes-Laplace Theorem 

The ^'probability of causes*' was first examined by an English 
dissenting minister, Thomas Bayes, F.R.S., whose famous paper 
(H:10) on the subject was submitted posthumously to the Royal 
Society in 1763 by Dr. Richard Price (well known to actuaries 
as the author of the Northampton Mortality Table). 

The great importance of Bayes’ contribution, however, has 
often been confused through misstatements of its original form, 
and inadmissible applications. Many text-book treatments, 
moreover, have failed to recognize the facts that Bayes dealt 
actually with the special case when the a priori existence prob- 
abilities, Kyy are equal, and that it was Laplace who gave (11:16) 
the generalized form when the k/s are not all equal (see H :166 :14:y 
and H:i75:29 and 32). As in the text here, the name "Bayes’ 
Theorem’’ should therefore be confined to the former case, with 
the term "Bayes-Laplace Theorem’’ as a preferable designation 
for the latter. 

A complete and very convenient reproduction of Bayes’ 
original argument, which followed a geometric process, is avail- 
able in H :190y and partly also in H :SS :294 and H :166 :48. Excel- 
lent discussions of the misstatements concerning both Bayes’ and 
Laplace’s methods, which have become so prevalent, may be 
found in li:166:lQA6 and P:5^:54. 


A; 10. Helmert's Distribution of <rsi and ‘‘Students” Distri- 
bution 

"Student’s" original derivation (¥1:117) of formula (44) was 
effected in 1908 through an empirical process of first finding the 
distribution of or^ by means of algebraic expressions for the first 
four moments and fitting Pearson’s Type III curve. In so doing 
he was unaware that the distribution of (given in (2) at p. 225; 
B; 13) had been obtained by Helmert in 1876 (H:50) — this fact 
not being generally known until Karl Pearson brought Helmert’s 
prior work to light in 1931 (¥1:174)- 

The first rigorous proof of the distribution of "Student’s" z 
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was given by R. A. Fisher in 1926 and is followed on 

p. 226; B; 13 here. 

The importance of “Student’s” work was not immediately 
realized. It has now, however, assumed a position of great 
prominence in the theory of small samples, largely through 
Fisher’s wide extensions. 

A memoir of “Student”, who was an English statistician, 
W. S. Gosset, may be found in H:15P. 


A; 11. Poisson’s Exponential *^Law of Small Numbers” 

Although this exponential was discovered by Poisson in 1837, 
it remained for the Russian Bortkiewicz (H:75), in 1898, to draw 
attention to its importance through an illustration — of interest 
to actuaries — ^based on the mortality in the Prussian army from 
the kicks of horses (see p. 308; C; 14). 

For this reason, apparently, it is sometimes referred to as 
the “Poisson-Bortkiewicz”, or even as the “Bortkiewicz”, func- 
tion. It seems much more proper, however, to credit so impor- 
tant a formula only to its discoverer. 

Some difference of opinion, likewise, is manifest with regard 
to the term “Law of Small Numbers” — a phrase suggested by 
Bortkiewicz; some modern writers have questioned its appro- 
priateness, and have suggested instead the “Law of Small Prob- 
abilities”. It is to be observed, however, that Bortkiewicz was 
undoubtedly right in using the word “numbers” rather than 
“probabilities” — for the function is applicable in particular to 
circumstances in which both q (or p) and nq (or np) are small, 
while it is not necessarily preferable to the Normal Curve when 
only the “probabilities” q (or p) are small without nq (or np) 
being small as well (see p. 267; C; 4). 


A; 12. The Generalizations of the * ‘Normal Law” 

The dependence of the normal forms (10) and (11) on a par- 
ticular method of approximation (see p. 203; B; 6), and their 
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consequent inability to deal with skew distributions, was realized 
early in the classical literature. The nature of the approximation, 
for example, was pointed out by Laplace (see H:5S:548-652) in a 
form equivalent to the j term of Edgeworth’s Generalized Law 
of Error (see H :1 07:329, footnotf). Poisson, moreover, and many 
years later De Forest independently (H:04), found the j term 
of Edgeworth’s series (see H:5;S:290). 

Gram, of Denmark (P:S0:182-3, and H:5J), however — under 
the inspiration of a Danish actuary, Opperman (see P:S;^:130, 
footnote) — was the first mathematician, in 1879, to represent a 
general system of skew frequency curves by a series of which the 
Normal Curve is a special case (H:50). Subsequently Thiele — 
likewise a Danish mathematician, who is known to actuaries also 
for other contributions (Hr^S, and see li:81 ) — reached Gram’s 
series by the use of his half-invariants (P:S^:183, and H:P4)« 
The German astronomer Bruns next gave a comprehensive 
treatment in H :108. More recently Charlier, the Swedish astron- 
omer, has discussed the derivation and application of these series 
representations in much detail, and has so far systematized their 
classification and methods of fitting that two general types are 
now usually named the Gram-Charlier, or Type A, and the 
Poisson-Charlier, or Type B. 

Edgeworth’s important work in England appeared between 
1883 and 1920 in the midst of these Continental researches, and 
was characterized by an unusually wide knowledge of the signi- 
ficant steps which had been taken by De Forest in the United 
States, and by the Scandinavian and German mathematicians. 
His method of approach, which was always more philosophical 
than definitely practical, laid great emphasis upon the meta- 
physical concept of probability, and the necessity of deducing 
from a priori considerations a universal law which would repre- 
sent the frequency distribution of a magnitude arising from a 
number of independently varying elements. For that reason, 
and because his extensive contributions — ^numbering 74 papers 
altogether — ^were scattered in many different scientific journals, 
his work has been overshadowed by the less philosophical re- 
searches of the Scandinavians and the more practical investiga- 
tions of Pearson and his followers. Edgeworth’s writings, never- 
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theless, constitute an outstanding presentation of a viewpoint of 
great interest. The main principles of his work should therefore 
certainly be known — a task which has been assisted notably by 
Bowley’s explanation listed at 


A; 13. The Search for a ^^Law” of Mortality 

Perhaps to some extent as the result of an over-estimation of 
the importance of the hypothetical 'life- table” (cf. P:154*401, and 
P:17f :282), and consequently too little emphasis upon the fact 
that the function of primary significance in the measurement of 
mortality must be the rate of mortality, g*, only (or an analogous 
function such as the force of mortality, ju,, or the central death 
rate, m*, or colog /?»), the classical literature is replete with many 
interesting attempts to reach a formula which might represent 
the values of from infancy to old age, or during a part only of 
the entire span of life. Unquestionably, also, the belief persisted 
for many years that a "law” of mortality, exhibiting itself as the 
ordered progress of the life-table’s theoretical cohort of persons 
marching through time, would ultimately be revealed to some 
lucky or diligent enquirer — a "law” which again would confirm 
the faith of those deeply religious philosophers who sought uni- 
formities in Nature as a manifestation of the Divine Will. Nor 
was this as fanciful in those days as it might seem now; for 
scientists were discovering the fundamental "laws” of physics 
and astronomy, and everywhere the destinies of men as well as 
of matter were contemplated as probably resulting from the 
unwavering influences of an Unseen Power. The failure to dis- 
cover so inflexible a "law” for the mortality of the human race 
merely brought gradually to light the unfathomable complexities 
which may at any time determine the mortality experience of an 
isolated group; it could not of course disturb the faith with which 
those earlier investigators conceived of the whole Universe pro- 
ceeding to its destiny upon a "law” of progress. 

This search for a "law” of mortality was, of course, supported 
by many philosophical and analytical discussions of the features 
which such a law might be expected to exhibit. Stimulating 
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though they are, it is hardly necessary to give here any summary 
of those speculations. The main references for the assistance of 
any interested reader, however, would be H:4S; H:1S; 

H:74; K:169; 11:66; and 


A; 14. The Verhulst-Pearl-Reed (the ‘‘Logistic”) Curve of 
Population Growth 

The influences which may be expected to direct the growth 
of populations have naturally been the subject of speculation 
and enquiry ever since the rudiments of a scientific approach to 
current problems first made their appearance. The original 
edict of Mai thus (11:11 ) — to the effect that a population unaf- 
fected by migration would soon face starvation if it grew in 
geometrical progression while its food supply increased only in 
arithmetical progression — is of course now obvious enough; in 
his day it was sufficiently alarming, and ever since that time it 
has re-appeared constantly in popular discussions of the world's 
politico-economic problems. The Malthusian principle, however, 
did not lead to any constructive mathematical formulation. That 
was, some years later, the accomplishment of the Belgian 
Verhulst, for in 1838 (H:^S) he reached a curve of type (102) 
with i4=0, and in 1845 (H:;^^) named it the “logistic". This 
important contribution lay dormant, however, until the curve 
was re-discovered independently by Pearl and Reed in 1920 
(H:13S) — ^Verhulst's prior work not coming to their knowledge 
until 1922 (see F:176:5 and F:96:5Q9). 


A; IS. The Development of Curves for Forecasting Mortality 

For many years it was the custom to base the calculation of 
mortality rates for any experience upon the age x as the sole 
variable, so that the function investigated was qx — the rate of 
mortality in the year of age a; to oc+l. Later the statistics deriv- 
able from the records of insurance companies showed that where 
“selection" can be exercised with respect to the group of lives 
under observation (whether by the medical examination of appli- 
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cants for insurance, or the self-selection of buyers of annuities) 
the rates of mortality in the year of age x to x+1 must be denoted 
as and viewed as a function of two variables — the age at 

entry, [x—t]y and the years of duration since entry, L During 
about the last fifteen years the realization has been growing that 
a progressive change has been occurring in these rates, and that 
consequently it may often be necessary to take into account a 
third variable — either the calendar year, z, in which the rates 
of mortality occurred, or the year of birth of the ‘^generation*' 
from which they are derived. 

When rates of mortality are thus analyzed by age and gener- 
ation, and if a change in the rates with time is thus revealed, it 
evidently may become important to “forecast" the rates which 
may be anticipated in the calendar years to come. A number 
of tentative forecasts based on graphic methods or the use of 
straight lines or parabolas have been published by Scandinavian 
and German investigators since Gyld6n*s (H:57) first effort in 
1875 (see F:ZS). Knibbs* advocacy of “fluent" or “projected" 
life tables appeared in 1917 in his remarkably exhaustive work 
published as an Appendix to the reports on the Australian census 
of 1911 (H :/;^^:380). The problem has received serious attention, 
however, only since the compilation of the British Government 
and British Offices’ annuitants’ experiences of 1900-1920 {H:15U 
and 11:143 — see also H:i57). A very useful summary of the 
various formulae which have been employed is to be found in 
P:;?S;165. Reference may also be made to Greenwood’s paper 
P:49. 

A; 16 . The History of the Method of Least Squares 

The problem of solving v observation equations for unknowns 
numbering less than v seems to have engaged the attention of 
astronomers since about 1750. The principle of “feast squares" 
as a means of accomplishing this solution appears to have been 
used originally by Gauss in 1795 (H :f04’*576). It was named and 
first published in 1805, however, by the French mathematician 
Legendre — an English translation of his exposition being 

available in 11:164 *576. 
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The appearance of this important method led quickly to an 
enormous literature (see H :53, and V\90 :21Z), particularly in 
Germany and France, where Gauss, Encke, Hagen, Bessel, 
Helmert, Laplace, and others explored its theoretical founda- 
tions and its applicability to astronomical problems, and exam- 
ined fully its relation to the postulate of the arithmetic mean 
and the ‘'curve of errors** (cf. p. 151; A; 3). The debt which the 
mathematicians of today thus owe to Gauss in particular, how- 
ever, has not always been fully recognized (cf. p. 164; A; 8, and 
p. 176; A; 19); indeed it may be well here to quote Deming*s 
sound opinion that Gauss “accomplished most of what we think 
we know now about least squares** (P:28:l). Later Kummel 
(H:50), whose work has been overlooked until recently, in 1879 
“filled in a good share of what was missing’*, as Deming (loc. cit. 
pp. 2, 42, and 47) has also pointed out, and papers^ by Stewart 
(H:1S4) and Uhler have clarified greatly the questions 

which arise when the independent variable x as well as the 
dependent variable y is subject to error (see F:122:380f and 
P:28). 

A; 17. Early Methods of Fitting Curves, and Methods other 
than those of Least Squares, Moments, and Mini- 
mum-x^ 

A brief summary of the three following methods, with refer- 
ences, is given in P:1 55:259, 

(i) The method apparently first used by Tobias Mayer as 
early as 1748, by which the equations of condition are merely 
summed in sets, was later used extensively. 

(ii) A method of “minimum approximation’*, the object of 
which is to determine the smallest value which can be assigned 
to the absolute value of the greatest discrepancy between the 
data and the fitted function, was solved by a laborious process 
by Laplace in 1799, named by Goedseels in 1907, andjsolved 
more easily by C. J. de la Vall6e-Poussin in 1911 (see also P:131 : 
14). 

(Hi) Edgeworth’s method, which was suggested in 1887, 
would minimize the sum of the absolute values of the residuals. 
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and thus is based on “Laplace’s First Law of Error” (cf. p. 159; 
A; 4), i.6,, on the hypothesis that the error function is of the form 

y= where |ic| is taken positively in both directions, so 

that the median is the most probable value. The solution for 
the case of two unknowns is explained and illustrated in 
103-109; when there are more than two unknowns, however, it 
is difficult to minimize the sum of the absolute values of the 
residuals, because, when all the terms are to be taken as positive, 
it is not known until the work is completed which are naturally 
positive or negative (see H:S7:49-50). 


Another proposal has been advanced more recently by H. S. 
Will ^\170), and is called by him descriptively a “method of 
mean difference functions**, or more shortly a “method of 
differences**. Arguing that the criterion of least squares is not 
entirely satisfying in respect of “data in which the errors of 
observation are small in comparison with the analytic deviations 
from trend**, and hence discarding the least squares process of 
minimizing the sum of the squares of the residuals, he remarks 
that his method gives a sum of the absolute values of the residuals 
less than those resulting from least squares or moments, although 
it does not rigorously satisfy Edgeworth*s desideratum that the 
sum of the absolute values should be a minimum. The pro- 
cedure — for which simplicity of computation and satisfactory 
practical results are claimed — ^is to develop a series of expressions 
for each parameter by differencing, and thence to take the mean 
value. For example, in the fitting of a straight line, where a 
and jS are to be determined from a set of v observation equations 
/x — a+Px for = 2, . . . , V, the increment kAx in x (where k is 

settled from a suggested rule) would give an increment in/', say 

A 

Akfxf of pkAx; hence = — ^ ; proceeding thus for each of the 

kAx 

v--k such increments, the v-k values are obtained for p, and the 
value sought is to be taken from the mean of all those values as 

; and a thereafter would be found similarly. In the 

{v—k)kAx 
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paper referred to {H:170) the expressions by this method are 
also given for fitting polynomials, hyperbolic, logarithmic, and 
exponential series, and the logistic curve. The process, however, 
is of course arbitrary, even though in some cases it may give 
reasonable practical results. 

Will’s method is closely analogous to a procedure which has 
in fact been used for many years. Woolhouse, for example 
(H:-40:403), in determining the constants of Makeham’s formula 
(83) in the logarithmic form log lx = logk+x\og 5+c*logg, differ- 
enced twice over interval /, whence (cf. P:5P:92) immediately 

^ • proceeding similarly for several values of x he 

A2 log /* 

then took the mean of the resulting values of log c, and based the 
final values for the other unknowns on mean values also. Wool- 
house’s method was extended later by King and H^rdy (H:62: 
200 — see also P:5S:94, and H:^5:81 for a complete numerical 
illustration) into one based on the use of four sums (instead of 
single values) of log from a to a+/ to a+2/-~l, a+2t 

to a+3^--l, and a+3/ to a+4/ — 1, by which the subsequent 
construction of the mean values was avoided. 

This employment of sums over consecutive portions of the 
data has also been used in Cantelli’s ‘‘method of areas” (H:^P), 
except that he bases his procedure on integrals instead of finite 
sums (see 11:1 OU and HiH 5:444). 

Two graphical devices of some interest which have been put 
forward may also be included in this record of curve-fitting pro- 
cesses, since they both employ principles beyond the mere draw- 
ing of a smooth curve through the unadjusted data. 

The first is Calderon’s invention of a mechanical contrivance 
for obtaining log c in Makeham’s formula graphically (see H:7^: 
173), although the proposal is now only of historical significance. 

The other, however, is more immediately practicable, since 
it concerns the prescribing of limits, in accordance with the 
theory of errors, within which the graphically graduated values 
of observed data should lie. Calderon again (in H:75:170) 
seems to have been the first in actuarial literature to entertain 
the idea, which was well described by G. F. Hardy in the fol- 
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lowing words (loc. cit., 193): "The probable error curve, by 
means of which Mr. Calderon represented the unadjusted values 
as a sort of river instead of a single line, was a useful suggestion 
for graphically graduating a series of observations. The great 
danger in graphic graduation was that they had not constantly 
before them in different parts of the curve any guide as to the 
extent to which they were justified in departing from the original 
facts in drawing their graduated curve. This method provided 
them with such a guide, in the form of limiting curves, between 
which, on the whole, the graduated curve should lie." In the 
graduation of the observed for example, we have, as at p. 274 ; 

C; 7, that o- { and in Chapter III it is shown 

that, under the conditions of a normal distribution (which here 
means that E^Qx should not be less than about 10) 96% of the 
area will be included between zt2<r (or 96% between =h3X). 
Furthermore, if is large enough that the observed px and q'x 
can be substituted as estimates for px and Qx (cf. p. 291; C; 10), 

it follows that the upper and lower limits g* db 2 will em- 

brace about 95% of the values of qx which will be obtained by 
observation. On this principle Orloff (P:96) has suggested plot- 
ting on the graph a vertical bar extending between these limits 
for each observed g*, and then performing the graphic graduation 
by drawing the curve smoothly, but so that it will cut all the bars, 
or as many of them as practicable. [Calderon’s proposal, although 
reached by a less easy analysis, was made for a graduation of 
w*, using in effect ±<7* (em bracing 68 %) instead of dz2<r as limits, 

and thus taking , which were set up on the 

graph (see H:79:172) as being approximately d= -7-^ since 

•E*+» 

The process follows 


/ 

nix' 


^x / — 7 

-=7 — , and V 1 — w* = 1 at most ages.] 


the ideas inherent in the 
tioned in Chapter X. 


’confidence" or "fiducial" limits men- 
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A; 18. The Mean Square Error of an Observation, when there 
are v Observation Equations and k Unknowns 

sT Wrifr-fry] 

The expression =i derived at p. 250; B; 26, 

v — k 

which in the classical notation (see p. 324; C; 21) is usually 
r 

written ^ 1 where p denotes the weight Wr and v the “resi- 
v—k 

duar' (Jr'^f 'r)y was deduced originally by Gauss in H:17: arts. 
37-39 (cf. p. 164; A; 8, and the references there given). Being 
thus firmly established as part of the formal procedure in deter- 
mining the mean square error of an observation when the number 
of unknowns is k, it is to be found throughout the text-books on 
the method of least squares, and is there applied frequently in 
examining the agreement which has been obtained between a 
fitted curve and the observed data. It is so used systematic- 
ally by Merriman, for example, in P:p{?:134, 137, and 197, and 
by many later writers (see P:f;^4’145, 153, and 168). 

In actuarial literature the formula was employed first, in the 
form (126), by Thiele in 1871 (H 45:321), as a test of the good- 
ness of fit of a mortality table graduation. The only writer — at 
least in English publications — who subsequently noticed Thiele’s 
procedure seems to be De Forest. His exhaustive series of 
papers began to appear in 1871 (see F:166)\ in 1873 he referred 
to Thiele’s methods (H:45:334), and in H:45:14 and H:54’6 he 
discussed the conditions under which (126) should be applied. 
Recently attention has again been drawn to Thiele’s work by 
Seal in P:1^5:6. 


A; 19. The Test for Goodness of Fit, and “Degrees of 
Freedom” 

The mathematical expression for the distribution of x^ a 
form equivalent to (127), was given originally by Pizzetti in 1892 
(H:71 :267). The x* test, however, was first developed for prac- 
tical use by Karl Pearson in 1900 (H:50), and constitutes un- 
doubtedly one of the most important of his many contributions 


13 



176 


History 


to Mathematical Statistics. In reality, as pointed out in Chapter 
V, it completed the theory underlying the Lexis method. 

For some years the formula was applied incorrectly in certain 
cases, since it was not then realized that the number of variables, 
Vj must be diminished by 1 for each linear ‘‘constraint'*. Certain 
discrepancies in the results of the test had been noted by Brownlee 
(H and Greenwood and Yule (H \H0 ) ; the necessity of allow- 

ing for the “degrees of freedom", d, was finally established clearly 
by R. A. Fisher (H \1S8) in 1922. Many of the statistical applica- 
tions of the X* test prior to that date are therefore erroneous. 
It should be noted, moreover, that the correct principle has 
always been recognized as the accepted procedure which emerges 
in the analogous Method of Least Squares, and is to be credited 
in the first instance to Gauss (see ¥\123\ and p. 164; A ;8). 

Tables of the integral P corresponding to values of Xo 
d+1 { — v) were first computed by Elderton, to 6 places, and, 
after publication in “Biometrika", were reprinted in P:P7. 
More recently, R. A. Fisher (in P:I0) has given a table of Xo 
3 places corresponding to selected values of the integral to 2 
places only, which in some respects is more convenient since 2 
places are ordinarily sufficient. The values are there shown for 
values of d from 1 to 30 (with the useful suggestion that for large 
values of d a close enough approximation is afforded b y assu ming 
that V^2 x 5 “normally” distributed about a mean y/2d — \ with 
unit standard deviation). For most practical purposes, indeed, 
sufficient accuracy is attained simply by reading P from a dia- 
gram (see P:I77:418, 422, and 640). 

The range is indicated by the following specimen values: 


Degrees of 
Freedom, 
d 

P-.99 

m 

B 

B 



Value of Xo 

1 

||HK]TT|H 

.004 

.455 

3.841 

6.635 

10 

2.558 

3.940 

9.342 

18.307 

23.209 




19.337 

31.410 

37.566 

1 30 

14.953 

18.493 

29.336 

43.773 

50.892 
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B; 1. The Nature of “Probability” 

It has been pointed out by many writers (see, for example, 
P:5^:3-10, 64-55, and 146-149, and P:W:i5 et seq.) that the 
theory of “probability” can be viewed as {i) a mathematical 
theory of arrangements; (ii) a study of the actual statistical 
frequencies of observed occurrences; or (Hi) a branch of logic. 
It will be convenient to speak of these three approaches as the 
mathematical t statistical, and logical respectively — using these 
terms, of course, merely for purposes of identification, since the 
three methods are not in reality so' sharply separable as this 
abbreviated nomenclature might at first seem to imply. 

In the purely “mathematical” sense, a “true” value for the 
probability is either known, or assumed to be known, a priori, 
and the problems then concern the “arrangements” which may 
arise (or any set of consistent hypotheses may be assumed, and 
the consequent propositions are to be deduced therefrom). The 
“statistical” viewpoint considers primarily the actualjy observed 
occurrences, when the true probability is not known a priori but 
is, instead, to be “estimated” a posteriori from the statistical 
frequencies observed. The consideration of probability as a 
branch of “logic” involves questions which are partly psycho- 
logical rather than wholly logical. 

It will thus be at once apparent that we are dealing with a 
subject which comprises the pure mathematics of the theorist, 
the statistical observations of the experimentalist, and also — 
inseparably interlocked with both — the contemplations of the 
philosopher. The variations of approach are thus so wide that 
it has even been suggested, with more than a little justification, 
that any attempt to set forth “the theory of probability” is largely 
an attempt at “a description of a state of mind”. 

Even though the student need not ordinarily concern himself 
too deeply with these often fine and sometimes trivial distinc- 
tions, nevertheless he should be thoroughly aware that the pos- 
sible variations in that “state of mind”, as they may influence 
the fundamental ideas essential to an understanding of the nature 
of probability, have of course been discussed at great length in 
many languages, and are to be found conveniently in English in 
such presentations as Balfour’s elegant “Defence of Philosophic 
Doubt — ^An Essay on the Foundations of Belief” (H:55), and in 
Keynes’ “Treatise on Probability” (P:75). A reading of those 
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speculations is perhaps more likely to confirm than to resolve the 
uncertainty as to what may constitute a “best” approach; any 
attempt to explore thoroughly the meanings of “rational belief”, 
“inference”, the principle of “insufficient reason” (Boole’s “equal 
distribution of ignorance” and Keynes’ “principle of indiffer- 
ence”), and the opposite principle of “cogent reason” (being an 
“unequal distribution of ignorance”), must even impose caution 
upon the very use, for example in a context such as this, of the 
words “perhaps”, “likely”, “uncertainty”, and “best”. Positive 
assertions are indeed discouraged by such contemplations, and 
we are brought inevitably to seek an approach to the applications 
of probability theory which will recognize the practical demands 
of any system of “statistical or scientific inference”, while yet 
conforming with the obvious theoretical requirements. 

For the theory of probability is intended really to provide a 
basis of measurement for scientific inference. That different 
people draw different inferences regarding a proposition from the 
same set of data is one of the recognized characteristics of a prac- 
tical world. It illustrates, in every-day life, the relationship 
between the inference which the logical situation would demand, 
the measure of the probability concerning the inference derivable 
a priori from the data or a posteriori from the observations, and 
the effect of the person’s state of mind. Jeffreys (although in a 
somewhat different connection) has put the matter well: “One 
person, reading the proof of Euclid’s fifth proposition, is com- 
pletely convinced; another is entirely unable to grasp it; while 
there is at any rate one case on record when a student said that 
the author had rendered the result highly probable” (P:^^:10). 

It may therefore be advisable at this stage to classify very 
briefly the schools of thought which have arisen from these three 
main interrelated approaches, namely, (i) the purely mathe- 
matically (ii) that based on the observed statistical frequencies] 
and {Hi) that in which the principles of academic logic form the 
starting-point. 

(i) Mathematical 

This was the classical approach, and was developed principally 
as a mathematical theory of arrangements. The “true” a priori 
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probabilities either are clearly known, as in direct problems con- 
cerning dice, coins, cards, urns, etc., or are to be assumed in some 
identifiable form. 

The notion of probability seems to have been mentioned first 
in China by Sun-Tze about 200 B.C. in connection with the 
probability that a birth would be that of a boy or girl. No really 
serious attention, however, was given to problems in probability 
until the fifteenth century in Europe, and then the discussions 
concerned mainly games of chance and gambling. After certain 
preliminary references by Pacioli, Cardan, Galileo, and Kepler, 
interest became greatly intensified by a series of questions pro- 
pounded by a gambler, the Chevalier de M6r6, to Pascal, amongst 
which was the famous ‘Troblem of Points** concerning the equi- 
table division of the stakes of two gamblers who discontinue the 
game prior to its completion. The fundamental letters which 
passed between the Frenchmen Pascal and Fermat in 1654 (see 
P:f *6), and the first printed work on the subject by the Dutch- 
man Huygens in 1657 (H:l),ledto the great *‘Ars Conjectandi** 
of James Bernoulli in Switzerland in 1713, and thereafter to the 
contributions of the French Montmort (H:5), De Moivre, 
Legendre, and Laplace, and the German Gauss. 

The fact that some of the modern analyses of these classical 
dissertations have placed certain of their premises and conclu- 
sions in better perspective does not alter the opinion that the 
mathematical foundations thus laid must, to a very large degree, 
form the basis of any reasoned discussion of the theory of prob- 
ability and mathematical statistics. It is true, of course, that 
much sharp criticism has been directed against Laplace and others 
in connection with their use of the * ‘Principle of Insufficient 
Reason**. That principle — which is perhaps more clearly de- 
scribed by Boole*s phrase “the equal distribution of ignorance** — 
asserts that the unknown a priori probabilities must be equal 
when our state of ignorance precludes the assignment of unequal 
values; i.e., to quote Jeffreys {F:68:20)j “if we have no means of 
choosing between alternatives, the probabilities attached to those 
alternatives are equal**; or, to quote Keynes (P:75:372), “when 
the probability of an event is unknown, we may suppose all pos- 
sible values of the probability between 0 and 1 to be equally 
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likely a priori"'. This ‘‘principle*', however, will be found upon 
examination to be in reality an arbitrary assumption; and, unless 
it is accompanied by a careful analysis and statement of the 
circumstances in which it is applied, it may lead — as indeed it 
led Laplace and others in the case of the so-called “Rule of 
Succession** — to some strange and paradoxical results. Most of 
the criticism, nevertheless, has been undiscriminating, and much 
of it unmerited and far too bitter.* These difficulties, however, 
clearly arise from that one special arbitrary assumption, and 
should not be taken as justification for discarding many of the 
fundamental mathematical concepts or for belittling (as is some- 
times done today) the monumental contributions of the Laplac- 
ian school. 

Within recent years some attention has been attracted by the 
efforts of von Mises which have been examined in 

English particularly by Copeland (H:iS^), to establish the ideas 
of probability upon an “axiomatic** basis, from which the entire 
theory would be deducible by strictly mathematical deductions. 
According to this method, the “mathematical** and “statistical** 
approaches are, in effect, related through a definition of prob- 
ability based on the notion of “sequences**. If there is a set, or 
“Kollektiv**, of n objects, then the “probability** of an object 
with a specified characteristic which occurs m times in the first n 

will be — when n is increased indefinitely; but von Mises requires 
n 

that two conditions must be satisfied: (a) the limit just men- 
tioned must exist, and {b) a “principle of irregularity (or dis- 
order)** must also exist such that the limit remains unaltered in 
any sub-sequence — for example, if the sequence H, T, T, H, H, 
H, T, . . . is the heads and tails of throws of a coin, and in the 
limit H appears in one-half of the throws, then also in any sub- 
sequence such as T, H, H, H, T, . . . , selected from any position 
in the sequence, and forming all or only some of the terms of the 
sequence, H must likewise appear in one-half of the throws. 
Great mathematical and logical difficulties, however, have been 

•The controversy mainly involves the theory of ‘‘inverse probability,” 
and consequently, for the reasons already stated in the footnote on p. 8, 
need not be pursued in this present study. 
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encountered in the attempts to establish by this means the funda- 
mental theorems of mathematical probability, while as a prac- 
tical method a major obstacle, arising from the definition of a 
“Kollektiv” restricted by condition (6), must be that the set of 
objects cannot apparently be identified as a ^'Kollektiv’* until n 
has been increased indefinitely (see P:W:30; V:SO\ and 

V:156:2). 

Another strictly mathematical treatment, based on certain 
axiomatic foundations, has been given by Kolmogoroflf 
and recently has been developed in an English presentation by 
Cramer In the latter will be found also many of the 

modern purely mathematical contributions of Borel, de la Valine 
Poussin, L6vy, Fr^chet, Cantelli, Khintchine, Liapounoff, 
Markoff, Romanovsky, Radon, and others. In view of the 
essentially practical requirements of this volume, however, it 
will be sufficient here to give the preceding references for any 
reader who may wish to pursue the complexities of these formal 
mathematical investigations. 

{ii) Statistical 

Keynes has pointed out (P:7^:92) that Ellis {ii\ 24 ) was 
apparently the first to state the importance of considering prob- 
abilities on the basis of observed frequencies. It was Venn, 
however who elaborated that viewpoint and really 

established the course of reasoning which has led to its wide 
acceptance by the modern English school. Venn's original and 
involved presentation, of course, has undergone much critical 
examination, and in its details would hardly now prove satis- 
factory to many of his followers. 

The discussions again have involved attacks upon the ^‘prin- 
ciple of insufficient reason" — for, as Ellis expressed the matter, 
"mere ignorance is no ground for any inference whatever; ex 
nihilo nihiV\ But since this position, which is equivalent to 
saying (as previously noted) that the "principle" is merely an 
arbitrary assumption, must obviously lead in many cases simply 
to a complete inability to reach a solution, attempts have been 
made to formulate conditions under which the "principle of 
insufficient reason" can be invoked. Von Kries, for example, in 
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his discussion of the opposite ‘'principle of cogent reason*' (H \67)^ 
elaborates the view that “the arrangement of the equally likely 
cases must have a cogent reason and not be subject to arbitrary 
conditions’* (see P:5^:7, and P:7^:42and87). In this connection 
Arne Fisher (P:S6:9) suggests that, since “a rigorous application 
of the principle of cogent reason seems impossible**, a compromise 
between that principle and the principle of “insufficient reason** 
may be effected “by the following definition of equally possible 
cases: ‘Equally possible cases are such cases in which we, after 
an exhaustive analysis of the physical laws underlying the struc- 
ture of the complex of causes influencing the special event, are 
led to assume that no particular case will occur in preference to 
any other*.** Keynes also (P:75:41), in a long discussion of the 
subject, during which he seeks to clear up the difficulties and 
formulate the “principle of insufficient reason** in a more precise 
and workable form, observes that it is not a “sufficient** condi- 
tion, and reaches the conclusion that the difficulties have arisen- 
“when the alternatives, which the principle . . . treated as equi- 
valent, actually contain or might contain a different or an in- 
definite number of more elementary units’* — that is to say, the 
principle can be utilized only so long as it is recognized that it 
“is not applicable to a pair of alternatives if we know that either 
of them is capable of being further split up into a pair of possible 
but incompatible alternatives of the same form as the original 
pair**. 

Faced with this questionable applicability of the “principle 
of insufficient reason**, and recognizing at the same time the 
difficulties of formulating and using any correction of it in the 
form of a “principle of cogent reason**, the modern supporters of 
the statistical frequency approach have therefore taken the posi- 
tion that “it should be possible to draw valid conclusions from 
the data alone, and without a priori assumptions** — they 

“disclaim knowledge a priori, or prefer to avoid introducing such 
knowledge as we possess into the basis of an exact mathematical 
argument** (ibid.), so that it may be possible to construct a 
theory of statistical inference without the use of a priori prob- 
ability (P:e7). 

The possibility of thus assembling a consistent body of doc- 
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trine based solely upon observed statistical frequencies, without 
recourse to the concept of a true a priori probability, seems first 
to have been examined rigorously by T. N. Thiele of Denmark 
(H i94) in 1884. But in so doing, of course (just as in postulating 
the existence, in the mathematical method, of the a priori prob- 
abilities which are now to be avoided), care must be taken in 
formulating the approach. It is obviously quite insufficient 
merely to suggest that the observed statistical frequency in “a 
large number of trials” is to be taken as the “probability” (as has 
been done so often in algebraical text-books and elsewhere) — for 
the question, “What is a large number?” again at once asserts 
itself. It is therefore clearly essential to adopt for the concept of 
“probability”, as Thiele suggests, “the limiting value of the 
relative [statistical] frequency of an event, when the number of 
observations amongst which the event happens approaches in- 
finity as a limit”. It will be noted here that, in effect, the em- 
phasis is laid upon the statistical frequency — the concept of 
“probability” not emerging until the ultimate hypothetical state 
of an infinite number of observations is reached — whereas in the 
mathematical approach the true “probability”, as an a priori 
concept, is supposed to be given, and we find ourselves at once 
faced with the question as to how closely, and under what condi- 
tions, it can be measured from the statistical frequency observed 
in a stated number of trials (see p. 263; C; 1). In neither the 
mathematical nor in the statistical approach, therefore, can we 
avoid, in reality, the essential problem of establishing the con- 
nection between the true “a priori probability” and the “statis- 
tical frequency” observed. 

The modern developments in the theory of “sampling” (see 
Chapter V) which have been founded on this viewpoint have, of 
course, a special interest for actuaries and vital statisticians on 
account of their evident plausibility and their practical nature. 
The mathematical and logical bases of the method, however, are 
still undergoing critical examination and discussion, particularly 
from those whose mental processes lead them to prefer a rigor- 
ously mathematical or logical, instead of a plainly empirical, 
approach. (See, for example, Keynes' comments, P:7^:92-110, 
and the interesting controversy and misunderstandings between 
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Jeffreys and R. A. Fisher, P:^7, and — partially summarized 

in P:ff4-138-142, and eventually resolved as stated in P:^Sa:323.) 

{Hi) Logical 

The first critic to assail the philosophical position of the 
mathematicians of Bernoulli’s time was Hume (H :8), To quote 
Keynes (P:7J:272): ‘^Argument by induction — inference from 
past particulars to future generalisations — ^was the real object of 
his attack. Hume showed, not that inductive methods were 

e 

false, but that their validity had never been established and that 
all possible lines of proof seemed equally unpromising”. D’Alem- 
bert, also — partly known for his discussions of the famous ”St. 
Petersburg Paradox” (see H:5^:258; P:5^:51; and P:75:316) 
originally considered by Daniel Bernoulli — is remembered (not- 
withstanding many demonstrated errors in his reasoning) for the 
scepticism which he likewise expressed. 

Within recent years Keynes (P:75) — acknowledging his 
indebtedness to Leibniz, ”who in the dissertation, written in his 
twenty-third year, on the mode of electing the kings of Poland, 
[first] conceived of Probability as a branch of Logic” — has dealt 
at great length with the description of probability as “comprising 
that part of logic which deals with arguments which are rational 
but not conclusive”. Following the English tradition of Locke, 
Berkeley, Hume, Mill, and later Bertrand Russell — “who, in 
spite of their divergencies of doctrine, are united in a preference 
for what is matter of fact, and have conceived their subject as a 
branch rather of science than of the creative imagination” — 
Keynes sets out, from the fundamentals of Logic, to develop, 
first, “the characteristics and the justification of probable Know- 
ledge”; second, the deduction, by the methods and symbolism 
of Formal Logic, of the usual theorems for the addition, multi- 
plication, etc., of probabilities; third, the methods of Induction 
and Analogy; and fourth, the foundations of “Statistical Infer- 
ence”. Except so far as mathematics are essential in some parts 
to illustrate his criticisms of other methods, Keynes accordingly 
bases his entire approach upon the principles of Logic. The fact 
that probability when so treated remains essentially non-measur- 
able must, under these circumstances, be dealt with by a variety 



Mathematics and Interpretations 


187 


of terminological refinements. The stimulating discussions to 
which this method leads, however, are of course important, and 
may be found in the works of Ramsey (P:^04) and Jeffreys 
(P:ffS and 68a), as well as in Keynes’ treatise (see also P:55). 

B; 2. Bernoulli’s Limit Theorem 

While the statement on p. 8 conveys the essential meaning 

of Bernoulli’s famous Limit Theorem — the Law of Large 

Numbers — in a convenient verbal form, it is not exactly what 

Bernoulli said. It is to be noted especially, therefore, that 

Bernoulli’s Theorem in more precise terms was this: 

If in a set or series of n trials of an event, in each of which 

trials the ^*true” a priori probability of success is a constant p, 

the actual number of successes is observed to be s (so that the 

• • • ^ 

‘‘statistical frequency” of success observed is — ) then the prob- 

n 

• • s 

ability, say P, that the discrepancy p between the ob- 

n 

s 

served statistical frequency, - , and the true a priori probability, 

n 

p, is less than a previously assigned quantity (say €>0), ap- 
proaches 1 or certainty (that is to say, will be greater than 1 — »?, 
where ?; is a previously assigned positive quantity) as the number 
of trials, w, is increased indefinitely. 

Or, symbolically, given two positive numbers e and r/, the 

probability P of the inequality p < € will be greater than 

n 

1 — r? if » exceeds a certain limit. 

The interpretation ordinarily placed on these precise state- 

ments is that the observed statistical frequency - , tends to 

» n 

coincide with the theoretical or true probability, p, as the number 

s 

of trials, w, is increased indefinitely. Symbolically, lim — = p, 

w ->'00 ri 

In other words, by postulating the existence of the true prob- 
ability, p — even if we are unable to determine its value a priori 
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(as can be done in the conventional urn and coin-tossing experi- 
ments, but under other circumstances frequently cannot be 
done) — it thus becomes possible to examine the degree of approx- 

imation afforded by an observed statistical frequency, — . The 

n 

great importance of Bernoulli’s discovery therefore lay in the 
facilities which it afforded for the examination of mass phen- 
omena, through the observation of statistical frequencies when 
the numerical value of the a priori probability was in fact 
unknown. 

Bernoulli’s original proof of his theorem — upon which he 
states that he spent upwards of twenty years — is to be found in 
modernized notation in P:^.^^:96, together with a translation of 
his explanations, which are important as clear evidence of his 
intention to investigate the degree of approximation afforded by 
observed statistical frequencies based upon a limited number of 
trials. The theorem was also established later by the investi- 
gations of De Moivre and Laplace, as indicated herein and set 
out in detail in P:f>^5:119, and may be derived easily from the 
Bienaym^-Tchebycheff criterion (see p. 218; B; 10). 

It may be of interest to note, even at this point, that a good 
deal of attention has been given recently to certain aspects of 
the proof and conditions of Bernoulli’s Theorem and the Law of 
Large Numbers. It will be seen that the theorem states that 
there exists a number, w, such that, for any single instance of a 
number of trials greater than n, the probability of a discrepancy 

(between — and p) less than a given amount (c) will exceed a given 
n 

quantity (1 — 1 \). In 1916, therefore, Cantdli's Theorem raised the 
question whether there exists a number, such that for all 
numbers of trials greater than n', the probabilities of all the 

infinity of simultaneous discrepancies (between — and p, where « 

n 

takes all values greater than »') less than the given amount («) 

will still exceed the given quantity (l — t?). The mathematical 
investigation of this important extension of the original Bernoul- 
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lian Law of Large Numbers gave an affirmative answer, which 
was later perfected by Kolmc^oroff and Glivenko. It has become 
known as the Uniform or Strong Law of Large Numbers, and, as 
will be seen, shows the probability that the statistical frequencies, 

— , will differ from p by less than € in the n'th and all the following 

trials is greater than 1 — (cf. P:S5:717-8, and P:f4^:101-3). 

Much work has also been done on the problem of establishing 
the limits and inequalities involved — particularly by Khintchine, 
L6vy, Glivenko, Kolmogoroff, and De Finetti. It will be suffi- 
cient here to refer to a valuable review of these investigations by 
Fieller (P:55:721), and to Uspensky’s observations in P:f4^*204, 
in both of which the inequalities are expressed in a form which 
has been called The Law of the Repeated Logarithm, The numerous 
original papers dealing with these researches are listed conven- 
iently in P:55:766-8, 


B ; 3. Diagrammatic Representations 

•*» 

An understanding of the nature of the problems of practical 
statistics will be assisted materially by visualizing the types of 
diagrams, distributions, curves, and surfaces represented by the 
various statistical tables or mathematical formulae. A sum- 
marized description is therefore given here, and is referred to 
throughout the text. 

The collation, into a statistical table or diagram, of a series 
of observations taken under essentially uniform conditions, but 
in which one or more characteristics are subject to variation, will 
give a picture of the observed distribution with respect to the one 
or more independent variables or variates (or “characteristics”) 

^li ^2> • • • • 

{i) For example, the actual numbers of deaths, y, occurring 
in a particular community (and under essentially uniform cir- 
cumstances) on each exact birthday would give an observed 
distribution of deaths with respect to one variable — exact birth- 
day Xi; the numbers of deaths similarly observed on each exact 
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birthday for persons of each exact height would give an observed 
distribution of deaths with respect to two variables — exact birth- 
day x\ and exact height x^; and so on for any number of such 
variables. If for one variable, Xu such values be plotted on 
squared paper with x and y axes, they will be represented simply 
by a series of points — because the variable Xi clearly can assume 
only integral (“discrete”) values (since only the deaths at exact 
integral ages are included in the observations) ; and if the points 
be joined together by straight lines (although in this instance of 
a series of integral values such straight lines have no interpre- 
tative significance except as a first approximation to hypothetical 
intermediate, i.e., non-integral, values) the resulting outline will 
be a series of two-dimensional jagged peaks if the data are very 
irregular (Figure Al), or of jagged humps (Figure A2). 

For two variables, Xi and we may, with xi, and y axes, 
picture the same process by erecting verticals at each point on 
the base (the Xi x^ plane) ; and if again the tops of the verticals 
be joined by straight lines, the picture obtained is composed of 
three-dimensional jagged (saw-toothed) peaks (Figure Bl). 

(ii) In many observational procedures, however, it will clearly 
be either impracticable or inadvisable (because of the extent to 
which the observed data are in fact reduced thereby) to observe, 
as in (i), the values of y only for integral values of xu ^ 2 , . . . . In 
the examples cited, for instance, the natural process would be to 
include all the data, whether at integral or fractional values of 
Xi, Xs, ... fin order to make use of the observed deaths not only 
at the exact ages, heights, etc., but also at all intermediate ages, 
heights, etc. Since, however, it would be unduly laborious to 
attempt the tabulations for every such intermediate value of the 
variables xu X 2 f ... t we are at once brought to the idea of 
grouping. The obvious procedure in the case of the distribution 
by age alone would be to classify together all those aged x last 
birthday, which means grouping the observed deaths in groups 
by single years of age (a to a+1 exact, a +1 to a+2 exact, etc.) — 
although a broader grouping (as for each span of 6 ages, a to a +6, 
a+6 to a+10, etc.) could often be used. The groupings a to a-f 1, 
etc., or a to a-f-5, etc., are spoken of as the class intervals. 

In now plotting such data we can either (a) assume (some- 



Mathematics and Interpretations 


191 


times) the middle points x+\ (or x+2\ as the case may be) and 
erect the ordinates thereon, or (6) employ the age-group (1 or 
5) as the bases for a series of rectangles. Joining by straight 
lines the points of (a) we obtain again a series of two-dimensional 
jagged peaks or humps, or a frequency polygon (the lines in Figure 
A3); (i), on the other hand, gives a diagram of vertical columns 
representing areas, or a histogram (the rectangles in Figure A3). 

For two variables, similarly, the same processes lead to 
(a) again an Alp-like series of three-dimensional jagged peaks 
(Figure Bl), or (/3) a series of three-dimensional square pillars — 
like a city built exclusively of square pillars, with bases of uni- 
form area but of various heights (Figure B2). 

{Hi) If now, instead of the preceding discontinuous classifi- 
cations by discrete values or groups, we suppose that the varia- 
tions are continuous, then the two-dimensional Jagged peaks 
would obviously become smoothed out into a curve — undulating 
(Figure A4) or not (Figure A5) according to the nature of the 
data — while the three-dimensional Alps or the pillar-like city 
would likewise be transformed into a solid mountain with curved 
slopes (Figure B3, in which the curves are symmetrical). The 
former is an observed frequency curve ; the latter is an observed 
frequency surface. 

If the observed curve, or the observed surface, can be postu- 
lated a priori, or described a posteriori, by a mathematical for- 
mula, the observed distributions will thereby be replaceable by 
a theoretical, or by a '‘fitted”, frequency curve y =f{xi) or surface 
y =f{xi, X 2 ). 


With due allowances in the parameters of the expressions 
adopted, these functions can of course be made to represent the 
distribution either of the numbers of cases or of their relative 
frequencies. The curve relating to the relative frequencies is 
generally referred to as a probability curve. For such a probability 
curve, say y — ip{x), the term frequency function or probability 
density is given to (p{x) ; the numbers of cases would be expressed 
by y—Nip{x) where N is the total number of cases; the number of 


14 
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cases in the interval between x^a and x — b would be iV J <p (x) dx ; 

the number between x and x+dx will be given by N<p{x)dx\ and 
if the whole distribution is included within x — h and x—k^ then 

<p{x)dx — \, and for all such values of x^ while if the 

Ja r+oo 

distribution extends indefinitely (p{x)dx=^ \ and (p{x) ^0 for 

J -CO 

all values of x — since the total number of cases must equal N, 
and a probability, of course, cannot be negative. 

If the ordinates of a frequency curve are added successively 
from its commencement to form a series of ‘integrated fre- 
quencies’* — thus giving, in the continuous case when the curve 

starts, for example, at —co , the function (p{x) dx for successive 

J -00 

values of t — the result will be a function (never decreasing) vary- 
ing between 0 and 1, and representing the probability for a value 
of X not exceeding t (like the smooth shoulder of a mountain in 
the form of Figure A6). It is sometimes called the Distribution 
Function of Probability, and the curve is generally identified as a 
Cumulative Probability Curve, If the number of cases N is intro- 
duced, the curve would, of course, be referred to as a Cumulative 
Frequency Curve, In the discontinuous case of a series of ob- 
served values, we obtain clearly a series of progressively rising 
points, as in Figure A7, while when groups are used they become 
a stairway with steps of varying height as in Figure A8. 

[Although the method is not of special importance for the 
purposes of this study, it may be well to note that the “cumu- 
lative frequency’’ principle may be exhibited graphically by the 
application of Galton’s Method of Percentiles, If, after a cumu- 
lative frequency diagram has been constructed in the form of a 
smooth curve as in Figure A6, the terminal ordinate is divided 
in half, and a horizontal line then drawn to cut the curve at M 
(Figure A9), the abscissa of M is the median — ^being, in fact, the 
value of the variable in the corresponding non-cumulative curve 
of Figure A6 such that its ordinate divides the area into two 
equal p)ortions; The three quartiles likewise mark the quarters, 
as shown; the nine deciles show the results of proceeding 
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similarly from the points which divide the terminal ordinate 
into ten portions; and the percentiles mark the percentage divi- 
sions — so that, for example, the 5th decile, 2nd quartile, and 50th 
percentile all coincide with the median. If we now take an axis 
marked off in ten equal parts, and erect ordinates corresponding 
to the deciles, the curve becomes of the type shown in Figure 
AlO — generally known as Galton’s Ogive.] 

The concept of distribution has been illustrated in the pre- 
ceding paragraphs for the cases of one or two independent 
variables, which can be represented diagrammatically by the 
direct use of our every-day knowledge of space of not more than 
three dimensions. The explicit functional relationship y=f(xi) 
between the dependent variable y and the single independent vari- 
able Xi is pictured by a graph in two dimensions*' with two co- 
ordinate axes Xi and y, and a point is fixed by its co-ordinates 
(jci, y). It is convenient to think of this as an implicit function 
relating the two variables X\ and y in the form F{xu y) =0, so that 
the case is one of two variables being represented by a graph in 
two dimensions. For the explicit function y=/(jci, x^ it was 
necessary to use three-dimensional space, with three co-ordinate 
axes X\^ X 2 f and y; here we have an implicit function of the form 
F(xit y) *=0 relating three variables (one being a function of the 
other two) and requiring three-dimensional space for its depiction. 

In dealing with more than three variables, as is often neces- 
sary, there is little difficulty in extending the mathematical 
principles; actual diagrammatic representation of more than 
three dimensions, however, is not possible. It is convenient, 
nevertheless, to continue the use of certain geometrical terms — 
thus a particular set of the y variables xu X 2 j ... y is still called 
a ^‘point’* in i/-dimensional space, with u co-ordinates referred to 
V mutually perpendicular axes; relations between the variables 
are hyper-surfaces \ and in such j'-dimensional space the v—\ 
independent variables will represent a — l dimensional surface. 
As the student will realize from the preceding explanations, these 
concepts of hypcrspace can usually be dealt with most clearly by 
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founding the course of reasoning upon analogy from the corres- 
ponding concepts in the 1, 2, or 3 dimensions with which we are 
visually familiar (cf. the method of proof used in B; 13). 

In this connection one device of considerable value is the use 
of contours (or ‘^contour lines*’). If we are dealing with an explicit 
function y =/(xi, of three variables Xu ^ 2 , and y, which repre- 
sents a surface, and we take all the points in the XiX 2 -plane for 
which y —f{xij X 2 ) has a constant value, fe, then these points will 
lie on the contour line for the constant value, fe, of the function. 
Alternatively, a given plane cutting a surface will have points 
lying on a curve designated a ^‘plane section” of that surface; a 
'‘horizontal section” by a plane y = k parallel to the ^Cijr 2 -plane, 
when projected perpendicularly on to the :x:i:x: 2 -plane, will there- 
fore give the "contour”. The contour thus indicates simply the 
variation of Xi and X 2 for the given value, fe, of y, and shows the 
shape of the surface. For example, if the parabola y=x^ is 
rotated about the y axis, a "paraboloid” results, which is repre- 
.sented by y —x^+z^ in Figure Cl. Giving k the values 1, 2, 3, ... , 
the contours of this paraboloid are obtained at heights y = 1, 2, 3, 
. . . ; and when projected onto the ^icz-plane as in Figure C2 they 
are shown to be concentric circles with the centre at the origin, 
from which the steepness or flatness of the surface is indicated 
by the closeness or otherwise of the contours. A change in the 
independent variables has the effect of moving a point (x^ z) in 
the xs-plane. Consequently, if an actual path on the surface is 
projected onto the jcz-plane, it will appear as a line or a curve 
across the system of contours, and will thus show how y changes 
with changes in x and z; in fact, y increases as the point moves 
from lower to higher contours, and vice versa. 

Vertical sections can be used similarly. The contour system, 
moreover, can be extended to functions of three independent 
variables ; the contour lines then become "level surfaces” 
f(xu X 2 t xz) for the function y=x^+z^+v^, for example, they 
are concentric spheres about the origin. 

The student who may wish to pursue these matters will find 
excellent discussions in P:7^:317, P:4:270, and F:S1:I, 460. 



Mathematics and Interpretations 


196 



Figure A2 
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Figure A4 
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B; 4. The Mean Square Deviation of the Point Binomial 

The expression (6) is clearly 
n*p*{q”-\-nq''~^p+, . .+/>“)— 2«/>(w2"~‘/>+. . .+«/>") 

The first of these brackets is n^p'^{q-\-p)'' =n^p^. The second, 
from (4), is —2np{np). The third may be written 

[g""‘+(«-l)2/>2""’+ 3PV*+- • 

= + (« - 1) ( 1 + 1 - 1) (” - 2) ( ^ ^2)/)»g"-» + . . . 

=np^q’^~^+(n-l)pq’^-^+— — ^p2g«-»+. . 

+»/>(M-l)/>[g’‘~*+(»-2)/>g"~*+. 

=np(q+p)’^~^+n(n-l)p\q+p)’^~^ 

’=np[l+{n-l)p]=np{np+q). 

The three brackets together therefore give 

n^p^—2np{np)-\-np{np+q) =npq. 

A simple alternative proof (given in P:50:141, and based on 
Laplace’s original method of differentiation set out in P:50:1O4) 
is the following: 

The expression (6) can at once be written — omitting the limits 
t=0 and <=« for convenience — 

S »C<g’‘-‘^‘/*-2»/>S ••Ctq”-*p‘ (a) 

Now (g+/>)”=S Differentiating this identity with 

respect to p, we have «(gH-p)’*“*=S "Ct t g""', whence 

«P(«+P)"~‘ =2 »Ct t p* g«-* ■•••(&) 

And now differentiating (b) with respect to p, 

w(2+P]""‘+«(«-1)P(2+P)”"*=2 »C,/*/)‘-»g"-‘ ... .(c) 
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Substituting from (6) and (c) in (a), and using g+p = l, we get 
(a) as i'^p 2 _ 2n^p^’j-n^p^ =^np — np'^ = npq. 

The same type of proof can be used in the other cases also. 


It may be useful at this stage to observe, in the notation of 
moments as stated at p. 254; B; 27, that the relationships estab- 
lished in the text, and above, are 

/i = np ....(4) 

and H = np{np+q) (5) 

or ii 2 = npq (6)-(7) 

By the same methods of proof we may also obtain easily 
M3 = n^P^ +3nV^g +npq{q - p) 
or f^z^npq{q-p) 

and M 4 = ^^P^ +6n^p^q — n^p^qiAp — 7q) +npq(l — dpq) 
or M 4 = npq(l — 6pq) +3n^p^q^ 

= npq[S(n-2)pq+l]. 

Other methods of derivation are given in P:5^:109-110 and 
:107-110. In the proofs shown in the last reference, and 
in some others to be found in the text-books, these moments of 
the binomial are given for the expansion (p+q^, instead of 
{q+pY as used here, with the result that p and q are interchanged 
in the results. The student should accordingly note, for exam- 
ple, that in P:Ji:110, Hardy thus states i^z^npqip—q) for the 
binomial (/>+2)”, in contrast with iiz — npq{q—p) for {q+pY 
shown in P:5P:110, and in this study. 


B; 5. Derivation of the Normal Law of Deviations, and Skew- 
Normal Forms 

Substituting (9) as ^ 2ir e"^ forn! in the factorials of (2), 

the latter becomes 
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■ (2ir)* 

( V \— np-~«— I / v\--n3+*— J 

'+»7) ('-»,) 


1 


• {2'Knpq )^ ' 

Taking the natural logarithm, and expanding, we then obtain 


log [yx{2Trnpq)^]k - {np+x+\) 
— inq—x+h) 


X x"^ 

\jip 2n^p' 

nq 


2»V 


2 + 
,2 


where 0i(x) and <#> 2 (:x:), the remainder terms, are convergent series. 

Now if it be assumed that n is so large that the terms invol- 
ving in their denominators powers of n above the first may be 
neglected, this becomes 


2n\q p ) 2n\p q) 2npq ^ 2npq ‘ 

1 __ x(p-q) 

Hence yx*~— 7 == c e • (i) 

V 2irnpq 


In this expression the exponent ^ differs but little from 

zero when n is so large that — and ~ may be neglected, and 

np nq 

becomes zero when p=q = ^. On those conditions of approxi- 

• x^ • • 

mation, however, - will not be negligible. We therefore 
Znpq 

reach finally 




a/ 2wnpq 


__ 

e '^^PQ 


.( 10 ) 


as a representation of the probability of a deviation of +x under 
the particular conditions assumed. 

Numerical illustrations are given at p. 266; C; 4. 


It will be noted that (i), involving x as well as represents 
a slightly unsymmetrical (skew) series of ordinates with pi^q, 
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and that when it reduces to the symmetrical form (10) 

involving only and not x. When written, like (11), as a con- 
tinuous curve 

1 x(p-q) 

f(x) = — j-e ^*e (ii) 

cyir ^ 

it is sometimes called appropriately (P:j^7:181) the Skew-Normal 
Curve. As will be seen from the illustrations on p. 266; C; 4, 

xjp-q) 

however, the effect of the skew term e is very slight, and 
may generally be neglected except when q (or p) is so small and 
n sufficiently large that nq (or np) remains finite but small. Even 
in those cases it will be preferable to use the Poisson exponential 
(55), which is developed later in Chapter VII, rather than this 
Skew-Normal form. The rationale of the expression, never- 
theless, is of interest in connection with the development of 
generalized frequency curves, as set out in Chapter VII. 


It may also be noted here (for later use in Chapter VII) that 
the retention of certain terms neglected in the derivation of (i) 

1 I x{p-q) ^ x*{q-p) 

leads to yxk —7 e 2npq Q{npq)t ^ proof is given 

V 27^npq 


easily in H :87 :33. This expression evidently may also be written 
approximately as 






' s/ 2vnpq 


'1-^ 


■P) 


■P) 


2npq ^{npqf 


? 


^ ^ 2x^\”| / 

which is 1— where c = v2«^. and 


j = ^ ^ = — , since nz = npq{q—p) as stated at p. 203 ; B ; 4. 

2V2npq 

[This last form, which is given later in the text as Edgeworth’s 
formula (59) of Chapter VII, may be encountered by the student 
in Bowley’s discussions of Edgeworth’s work. It should therefore 
be remarked that Bowley’s analyses in H:f5;8:33, 47, etc., and in 

07:329 et seq., usej= -- = ^ ^ as above, basing the for- 

c* 2 V 2npq 
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mulae on the point binomial {q+pY in (3) et seq.; in H:57: 

33-36, however, his analysis is based on so that j = — 

= in that case, as pointed out at p. 203 ; B ; 4 here.] 

2 V 2npq 


Retaining still further terms, the result is Edgeworth’s (60) 

in Chapter VII, where i == — — 3V which is i = ^ for 

^ / ^npq 

the point binomial {q+pY* stated in li:162:47 (cf. p. 203; 
B;4). 


An extended algebraic analysis of the form taken by the 
expressions in the point binomial case has been given recently 
by Shannon in P:i^^:380“l. The Skew-Normal approximation is 
there deduced for the point binomial (p+qY* so that (i), which 

1 « 

is based on (3 +/>)**, becomes --7======= ^ 2 hpq ^ 2 npq , xhe 

V 2Tcnpq 

statement of this function, moreover, is given separately for 
negative and positive values of the deviation so that the 
numerator of the exponent of the skew term appears as h{q—p) 
on the positive side for a positive deviation but as h{p—^ on 
the negative side for a negative deviation t\. 


B; 6. Finite Integrations of the Normal Law of Deviations 

The dependence of the Normal Law of Deviations, (10), upon 
integral values of np+x imposes the restriction that any func- 
tions involving summation derivable therefrom shall be deter- 
mined by finite integration. 

The probability, for example, that in the n trials the number 
of actual occurrences of the event will lie between np—k and 

will be the summation S y». For the Normal Curve of 

Error, however, x (and c) may take any values; the corresponding 

r+* 

probability deducible from (11) is therefore the integralj y^dx. 
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The algebraic processes of finite integration, accordingly, should 
be employed for the former, and the infinitesimal calculus for the 
latter; and the demonstrations (to be given later) of certain func- 
tions derivable from these two expressions should either recognize 
the distinction by the use of parallel proofs based respectively on 
finite and infinitesimal summation, or should employ a suffi- 

ac*=+A5 r+fe 

ciently accurate approximation between S 3'* and 

J-jfe 

The establishment of parallel proofs, of course, is quite simple 
when the summations or integrations are performed over the 
entire range. Under those circumstances the resulting formulae 
are identical, as will be seen from the derivation of the mean 
number of successes, np, for the point binomial in (4), and for the 
Normal Law of Deviations (10) — or for the Normal Curve of 
Error (11) when c = \/'2npq — and similarly for the^mean square 
deviation, npq. The demonstration of the average deviation 
(irrespective of sign), which is given in this study for th e Nor mal 

Curve as may likewise easily be shown to be A/ for 

the integral case of the point binomial — see, for example, P:51 : 
no (from J.I.A. XXVII, 214). 

When the entire range is not covered, an approximation 

between S yx and yxdx is required. It can of course be 
X’^—k J —k 

effected to any desired degree of accuracy by using the well-known 
formulae for approximate summation, as illustrated in P:57:122 
and P:l^:375. (The problem has also been examined elaborately 
in H:S9:249, and rules for computation are given in P:^^:377.) 
Remembering, however, that, by neglecting the differential 
coefficients which arise in the formulae of approximate summa- 
tion, the ordinate yn may be regarded practically as representing 
rn+h 

yxdx, it will generally be sufficient to extend the limits 

J n- j 

of integration by J at each end and so take the integral as 

p+i p+i 

yxdx — 2 yxdx (since the expressions (10) and (11) are 

J-ife-i Jo 

symmetrical), or to leave the limits of integration unchanged 


15 
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but to add h{yk+y-k)^ which is yk, and thus employ 

= 2 J yxdx+yk^ (See p. 268 ; C ; 5 here for the metho 

pies of numerical computation, and also P:i7-^:48, H:5P:249, and 
P:f^^:35.) When the whole range is covered by taking k—co, 

r+00 

y^k and yk, of course, vanish, so that y^dx may be substituted 

J —00 

for S yx, and then, as already noted, the expressions appro- 
*■■ — 00 

priate to the integral case are those resulting from the integra- 
tions of (11) when c = \/2npq. 

In calculating such values (see p. 160; A; 5) the integral 

r* 1 _ 

e ^ is taken, by changing the 


yxdxy where yx = — r — 

1 0 V 2Trnpq 


variable to / = ™ , as ^ 


J 0 


e dt where c = V 2npq. Similarly. 


1: 


= j- 

0 V ^ J 0 


-f (*+i) 




dt» 


B; 7. The Integrations of the Normal Curve 

In the following demonstrations based on the use of the 
integral calculus, the probability of a ‘^deviation’^ or **error” 
between x and x+bx, where bx is infinitesimally small, will, of 
course, be taken as yxbx if y* is the probability of a deviation or 
error x. While this will be entirely clear to those students who 
are thoroughly familiar with the concepts of the calculus, it may 
nevertheless be advisable to observe that the principle follows 
at once from the fact that y* represents a probability curve, so 
that the probability of an error between given limits is equal to 
the area under the probability curve between those limits. When 
the limits are infinitesimally small, this area is represented simply 
by a rectangle of which the height is y* and the base the infini- 
tesimal bx. 

Alternatively, the derivation of (12) may be viewed simply 
as the calculation of the average value of the deviation or error x, 
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Cy/ w 


from the continuous curve (11), and hence — omitting — ^ from 
both numerator and denominator — being 

r+oo _ ^ 
xe dx 
CD 0 


'+00 


dx 


C \/ TT 


by (c) and (a) here, =0. 


The integrals in this expression, and in (12), (13), (14), and 
(16) are found as follows: 


(a) In 
tegral therefore =£: 


r+oo ^ 

e dxy put —t and hence dx = cdt. The in- 

oo C 


'+CD 


e dt — 2c 


e dt =c\/Tr by (b) here. 


(b) 


e dt = , which may be shown thus: Call 

0 2 


e ^'dt = u. Then, substituting a/ for /, 


0 


e fl (f/ =w, whence 


(multiplying each side by e ""*), 

foo ra 

both sides now with regard to a, 

Jo Jo 




adt = ue ^ , Integrating 


= ^2 But 
1 


^-a2( 1+^2) “loo ^ 


ue ^ da 


2 (1+^2) Jo 2(l+/2) 


. Hence 


= 


2J 

(^) 


= “ tan H = { 0 ) = , so that u = 

0 1 +/2 2 L Jo 2 \2 / 4 2 


xe dx is obtainable by direct integration, and is 


c^e 


(d) In (13) we deal with twice the integral from 0 to oo when 
sign is disregarded, instead of the integral from — a> to + oo when 
sign is taken into account as in (12). 
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(e) To evaluate I e dx, integrate by parts with xe 


as one part, and thus obtain - 


‘xc^e 


L 


X* “loo 

- + i ^ 

Jo -6 Jo 


dx. Since 


the bracket vanishes, the remaining integral becomes - 

X • • (?'\/ TT 

by putting— =/, and this, by ( 6 ) here, is — — • 
c 4 


i:'- 


di 


B ; 8. The Integration in the Case of Two Independent Observed 
Quantities 

The evaluation of 


1 




1 

dx — -- e dy 

ky / TT J «~x 


....(24) 


is effected as follows (cf., for instance, H:^^:28-33; V:80:12S\ or 
P:f^;?:292-5). 

When 6 a is an infinitesimally small increment we have, by 
the geometrical principles of the integral calculus (the ‘‘mean 

ra+«a 

ordinate rule” for integration), that ip{x)dx = \[ip{a-\-ha) 

+^(a)] 6 a, since the area depicted by the integral is a rectangle 
with base 6 a and mean height Mv>(a + 6 a) +^(a)]. In (24) above, 

ri+at-jc 

e dy may accordingly be written 

J M-X 

(a-x)* *! _ (a-x)* 

e A* +6" 62 = e ** 62 , 



since 62 is infinitesimally small. 

In now replacing the second integral of (24) by this value it 
is to be remembered that x may take any value from — 00 to + 00 ; 


the expression e ** 62 must therefore be brought under the 

first integral, so that (24) becomes 


1 r+oo 

— r 

ckirj^oo 


k* 


bzdx 


=A r 

ckir 


rx» . (•~ag)h 
! k* \ 


dx . . (24a) 
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Now 

= -2^zc»] + ^ 


2:j[:2c2’| 

L k^+cU k\ 


( 2^2 \2 

1 to make a perfect square, 

/ 

and subtracting correspondingly outside, this becomes 
\ k^+cV k\k^+c^) 

c^k'^ \ k^+cV ^ k^+c’^' 


Putting 


ck 



= /. 


so that dx = 



dt, 


and remembering that z is to be regarded as constant, (24a) may 
therefore be written 


bz 

ckir 


e 


’+00 


Wk^+cV 


dt 


bz 

irVFT? 

1 

V^Vjfe^+C^ 


_ 

e (-n/tt) by (b) on p. 209; B; 7 


e bz. 


A useful geometrical interpretation of the preceding proof 
has been given by Crofton {li: 42 ), and is reproduced (with slight 
changes) in P:Jf5:25. 

The derivation of (24) may be explained in an even more 
elementary manner by setting out the various combinations of 
errors which might occur when ac+y= 2 , i.e., x — nh , . . . , x — h, 
X, x+ht . . .,x+nhin Fu andy+nh , . . . ,y+A,y,y"“A, . . . .y — nh 
in F 2 f then assigning to each its probability, and combining — as 
is shown in H:52:29, 
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In many texts the student will meet an explanation of (26) 
which may be extended as follows: Any error x in Fi, and y in 
F 2 , producing an error x+y, say X, in F 1 +F 2 , will give a square 
of error X^=ac^+y^+2xy. This being true for all values of x 
and y, the mean values of the two sides of the last equation will 
be equal ; and since it is known that the means of the squares of 
the errors in Fi and F 2 are <7^ and <t\, it follows that the mean 
square error, say in F 1 +F 2 will be given by £* = <ri + <r 2 + 
2 (mean value of the product xy). But the errors x in Fi follow a 
Normal Curve, in which particular positive and negative values 
of X are equally likely, while similarly the errors y in F 2 obey 
another Normal Curve with particular positive and negative 
values of y again equally likely. Whatever parameters those 
two normal laws have, therefore (i.e., whether the normal curves 
show wide or narrow distributions of their respective error sys- 
tems), there is an equal probability that a particular value of x 
will be associated with a negative or with a positive value of y, 
and vice versa. Since positive and negative values thus cancel 
each other throughout, the mean value of the product of errors, 
acy, must be zero. Consequently, 


B; 9. The * ‘Dispersion” (Standard Deviation) under Bernoulli, 
Poisson, and Lexis Sampling 

The derivation of the formulae will be seen most easily by 
following three urn schemata (cf. P:5^:117-122, P:;^7:323-8, and 
P:7Je:146-155). 

(o) For the Bernoulli sampling, suppose that we have v urns, 
containing white and black balls, knowing that in each urn the 
true probability of drawing a white ball is p. We draw a set of 
n balls (one at a time, the balls being replaced each time) from 
the first urn, 17i. Suppose we get ai white balls. Then from the 
second urn, C/ 2 , we similarly draw n balls — remembering that 
the probability of a white ball is again p \ and suppose we get a 2 
white balls. If we continue in exactly the same way with each 
of V urns — p having the same value for every urn (so that 
pi^p%^> . . =p^5=p) — the sequence ai, a 2 , . . . , a„ forms a samp- 
ling of the Bernoulli type. 
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This is clearly the same as if we had v different districts, each 
populated by a single group of n males of a certain age, for whom 
the probability of survival (represented by the proportion of 
white balls) is under examination, and in each of which the ‘^true” 
probability of survival is a constant p. 

{b) For the Poisson case, suppose that there are n urns with 
white and black balls — the proportions of white balls being differ- 
ent in each urn, or pu pi, * * pn respectively. By drawing one 
ball from each urn we get a set of n balls altogether; suppose that 
ai of them are white. Having replaced them, the process is 
repeated — giving a second set, of which are white. Making 
V sets of drawings thus, the sequence ai, a 2 , . . . , a„ forms a samp- 
ling of the Poisson type. 

This scheme can be visualized as v districts, each populated 
by n separate males (of n different ages) with different true prob- 
abilities of survival pi, pi, , pn- The average chances are the 
same for each universe from which a sample is drawn, but they 
vary from group to group within the universe. 

(c) For the Lexis case, suppose that there are v urns, as in the 
Bernoulli scheme, with white and black balls — the proportions 
of white balls again being different in each urn, or pi, pi, py 
respectively, as for the Poisson case. But balls are now drawn 
as for the Bernoulli scheme. That is, we draw a set of n balls 
(one at a time, the balls being replaced) from the first urn, Ui — 
getting, say, ai white balls; then another set of n balls from Ui, 
giving ai white balls; and so on up to the final set of n from Z7„, 
getting ay white balls. Then the resulting ai, ai, , . , , ay form a 
sampling of the Lexis type. 

Here we can imagine v different districts, each populated by 
a single group of n males of a certain age, but each group having 
a different true probability of survival — pi, pi, • • • , py respec- 
tively. The probabilities vary from universe to universe from 
which the samples are drawn. 

Now if we were presented with a sequence a^, aj, . . . , of 
white balls, drawn by any one of these methods, there would be 
no a priori reason for choosing any particular a rather than any 
other; consequently the arithmetic mean might be taken, 
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namely, . What, then, would be the expected 

V 

value of this arithmetic mean, and what would be the average 
of the mean square deviations over all the sets, i.e., under 
these three urn schemes? 

In order to determine these values, and at the same time 
show the three cases in parallel forms, we can use the same prin- 
ciples as those adopted for the Bernoulli case in reaching formulae 
(4) and (7). 

(a) To illustrate Bernoulli sampling again here in the simple 
urn -drawing scheme, we find 

In Set No. 1, the expected value of a i is np 

ti n it o << 

ttj IS np 


In Set No. v, the expected value of a, is np. 

Consequently the expected value of 

+ ■ ♦ * + tty _ viV'P) 

V V 

Similarly for the mean square deviation — remembering that, 
as shown in (7), the expected value of the squares of the devia- 
tions measured from the mean in the Bernoulli case is npq — we 
have 

In Set No. 1, the expected value of {ai— npY is npq 

‘‘ 2 , “ {a^-npyisnpq 


In Set No. j', the expected value of (a^ — npy is npq, 
ming, therefore, and averaging by dividing by v, we find 
2 __ vjnpq) _ 


Sum- 


<tb — 


■^npq. 


(ft) For the Poisson case the same method gives 

In Set No. 1, the expected value of is {pi+p 2 +* • •'^Pn) 

" " " 2 , a,is{p,+P2+,..+p,) 
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In Set No. v, the expected value of o, is (pi+Pi+. . .+pn)- 
The expfected value of 


«!+"»+• therefore = 


Up 


where 




/>l+/>2 + . . .+/>n 


In examining the mean square deviation here it must be 
noted that one ball is first drawn from Z7i with the knowledge 
that the probability of getting a white ball is p\\ this is the same 
as n trials with probabilities pi and qi when » = 1 ; the mean square 
deviation for this single trial is therefore npiqi where n = l, or 
piqi] and similarly for each subsequent ball drawn to make the 
first set of n. Consequently, since the trials are all independent, 
(T^ for their sum is to be obtained, by (27), from the sum of the 
various values of o-^, so that in Set No. 1, the mean square devi- 

ation is piqi+p 2 q 2 +^ • • + PnQn- H ptQu and in every other set 

up to Set No. V we get the same result. Summing, therefore, 
and dividing by v, as in the Bernoulli case, we find 

4=2 il [\p + {pt-p)]{q-{pi-p)\] 

/-I <=»1 

= 2 [pq-{Pt-p){p-q)-{pt-PY] 

/c 1 

= npq- S ipt-py, since S (.pt-p)=0, 

1 <-n 

=4-«4 where 4 is— S (Pt—py< 
n fi 


being the mean square deviation of the probabilities pi, p 2 , • • • t pn 
from their mean p. 


(c) In the Lexis case it is evident that 

In Set No. 1, the expected value of is npi 

“ “ ** 2 , ‘‘ “ ‘‘ a,isnP2 


In Set No. v, the expected value of a, is whence the 
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expected value of 


ai+a2 + » 

V 


.+a^ w(/>i+/>2+. . •+Pv) 

~ IS „ 

V 


= np 


where = 

V 

Consider now the mean square deviation. In drawing n balls 
from Z7i, when the probability of a white ball is p\, the mean 
square deviation of white balls from np\ is npiqu similarly for U 2 , 
when the probability of a white ball is /> 2 , the mean square devia- 
tion of white balls from np 2 is np^gi, and likewise for each draw- 
ing, until finally the mean square deviation of white balls from 
np^ in drawing n balls from is np^q^. These several mean 
square deviations are measured from the means, npi, np 2 y . . . , 
np^y respectively. But they are here, in conformity with 
the Bernoulli and Poisson cases, to be measured from np 

(where p = as above) ; we must therefore (see 

p. 255; B ; 27) add to each of the preceding the respective values 
(«pi— n/))2, {np^ — npYy ...» {np^ — npy. We thus have nptqt+ 
{npt'-npy for the mean square deviation of white balls from np 
in drawing n balls from £/< (/ = !, 2, 3, . . . , v). Summing and 
dividing by v as in the derivation of the corresponding expressions 
for the Bernoulli and Poisson series we find that for the Lexis series 


<r£, = — S Ptq.i'\ 2 (pt'-^py- It was shown incidentally, 

V /-I V t^i 

t^v 

however, in the Poisson case that S ptqt can be written as 

/-I 

vpi— 2 (.Pt—pY'y consequently a\-npq+ - — - S {pt-pY 

t-i V 1-1 

= <rB + (n*—w)<r! where <r| is— S (/>|—/>)^ being the mean square 

deviation of the probabilities pi, p 2 » . . . P^ from their mean p. 
Another type of proof employing ‘'generating functions’’ is 
given conveniently in P :3 :16-25 and 49-53. A further discussion 
of the assumptions underlying these Poisson and Lexis cases may 
also be found in P:1 46:208-234. 


The Charlier Coefficient of Disturbancy is founded on the last 
formula. Dividing by p^ we have 
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where A is written for the mean np. Charlier's Coefficient C is 

taken as . 

A 

The Relation between the Lexis and y} Methods 

• 

The equivalence of and ~ , in testing the Bernoullian 

V 

hypothesis of constant probability, as remarked by Irwin (P:^5: 
507), '‘has perhaps been insufficiently appreciated”. Even though 
the student will not encounter the yj method until later in this 
text, it may therefore be well to establish the relationship here, 
in order to emphasize R. A. Fisher’s statement (P:>^5:83) that 
“in many references in English ... it has not . . . been noted that 
the discovery of the distribution of y} in reality completed the 
method of Lexis”. 

The demonstration — following Irwin’s discussion in V.BS: 
507 — may be based on the mathematical model used in (ii) on 
p. 245; B; 24. If we are given, for example, an observed series 
/y for values of r from 1 to y, which may be regarded as samples 
from a population of n in which the true probability of occurrence 
is p (and of failure, g), so that the true mean is np, the Lexis 

j'£(/:-npr 

function L^ is found by (41) as ^ . In the y^ meth- 

npg 

od, on the other hand (as explained further in Chapter VI, and 
in (ii) on p. 245; B; 24 and (4) on p. 337; C; 25), the hypothesis 
of constant probability is tested by computing 

r-i L n — np 

with V “degrees of freedom.” This expression, since + 5 = 

ii {f'r-npy _ 2 

reduces to ; and this is vL^, so that L® and — are 

npg V 

equivalent. 
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B ; 10. Tchebycheff’s Inequality and its Extensions 


In 1853 Bienaym^ {li:S9) suggested certain fundamental 
ideas in connection with the Law of Large Numbers, which in 
1867 were established independently by Tchebycheff (H:57), and 
since that time have undergone development at the hands of 
Tschuprow, Markoff, Bernstein, Khintchine, Guldberg, Meidell, 
Pearson, Camp, and others. 

The Theorem^ or Inequality, of Tchebycheff, or the Bienaymi- 
Tchebycheff Criterion — as the basic inequality is variously named — 

states that the probability is ^ that a value of x, taken at 


random from n values Xi, X 2 , • . • , x^ will differ from their mean 
by as much as or more than acr (where a is > 1, and o is the stan- 
dard deviation). For example, the probability is not greater than 
that a variate taken at random from that “population*' will 
deviate from the arithmetic mean by at least 4 times the standard 
deviation. Similarly, it follows that the probability is greater 
than H that the deviation will be less than 4 times the standard 

1 

deviation. For, writing m for the mean, we have m = ~ S 

W r«l 


and 


•1 r—n 

^2— 2 (Xr — m)2 

n r-1 


.... (a) 


n 

and by Tchebycheff's inequality not more than — of the can 

a* 

deviate from m by more than a<r. In order to establish this, sup- 
n 

pose that — of them do so deviate by more than acr each; the 
a* 

sum of the squares of their deviations would then be greater than 

— (acr)*, that is, greater than wcr*, which is inconsistent with (a), 
a* 


and is therefore impossible. 

The derivation of Bernoulli's Theorem from this inequality 
proceeds by investigating the probability (see p. 187 ; B ; 2) that the 


deviation p 

n 


will be.numerically greater than 


since, by (22), cr 






By Tchebycheff 's criterion, this 
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probability does not exceed ~ . Writing a (i/? = c, it follows 

<hn 

that the probability in question does not exceed — . Here pq 

€^n 

can never exceed i, and for any assigned e the expression ap- 
proaches zero as the number of trials, w, is increased indefinitely — 
that is to say, the probability of obtaining a deviation greater 
than 6 tends ultimately to 0, and consequently the probability 
of a deviation less than any assigned positive quantity e ap- 
proaches 1 as the limit. 

Tchebycheff’s inequality places an upper limit on the prob- 
ability that a variate will deviate from the mean by at least a 
given multiple of the standard deviation, and it will be observed 
that no restriction is placed upon the nature of the distribution 
of the population values Xu X2 f , . . ^ Xn- This important feature 
led Karl Pearson (11:128) to the extension that the probability 

is ^ that a value of x represented by a continuous function 

will deviate from the mean by at least acr (which reduces to 
Tchebycheff's inequality when r = l), and Camp has evolved 
(¥1:136) a still further generalization which includes both of the 
preceding cases (see P:n^:143, and also P:75:353 and F:1 46:182 
for the contributions of the Russian mathematicians). 


B; 11, ‘‘Presumptive’’ Values 


Formula (42) provides an estimate or ‘^presumptive” value, 
(tI, of the (T^ in the universe, from the <r^ of the sample, by the 


relation = 



In addition to the proof given on p. 35, 


and the alternative demonstration (p. 38) based on the Principle 
of Insufficient Reason, the student may be referred to P:52:35 
(or 434-5 in the J.I.A. reprint), to P:140:ll~20, and to P:^5:345- 
362 for further analyses. 

Since the standard deviation is defined, as shown in formulae 
(6) to (8), as the square root of the second moment about the 
mean, the expression for may be put in terms of moments as 
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as derived from the sample and is the estimated value for the 
universe. (Steffensen’s notation, W2 and m2, is used here for 
simplicity in view of the references to his discussion given above). 
The corresponding relations for the estimates of the mean (mi) 
and for the third and fourth moments about that mean (ms and 
W4), give the series of formulae 


mi= mi 

m2 




m2 


and 


m3 = 


m4 = 


ms 


(«--l)(«— 2) 
n^ 

(n-l)(n2-3n+3) 


y....(42a) 


3(2n-3) 2 


]| 


Algebraic proofs may be found in V:52: loc. cit. and P:<95:350-2. 
[It should be noted, as pointed out by Tschuprow (Biometrika, 
XII, 187) and Lidstone (J.I.A., LXI, 346), that in V.52 the for- 
mula for the fourth moment is incorrect, being too small by 

^ M2 in the notation there employed, due to the omission 
n^ 

of 2 before Xxlxl in line 15 of T.A.S.A., VIII, 35, and line 9 of 
J.I.A., XLI, 435.] 

These expressions appear (see P:Jf .4^:14) to have been given 
first, in terms of half-invariants, by Thiele (H:94»48), who 
derived the series of formulae as far as the 8th half-invariant. 

The criticisms which, as pointed out on p. 36, may be levelled 
at these ^‘presumptive’* values led Tschuprow to another set of 
formulae which agree with those of Thiele stated above for mi, 
m2, and ms, but for m4 give 

7 iw ”o^/^ [(«’-2«+3)»I4-3(2m-3)w^]. 

(n— l)(w--2)(w— 3) 

Reference tnay be made to P:1 40:18 for further details. 

The values of the m's in formulae (42a) are in reality based 
(as noted 6n pp. 35-36) on mean values derived from a large 
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number of samples. The practical problem of which a solution 
is required, however, is the estimation of the values in the uni- 
verse from the data furnished by a single sample. This important 
distinction has been emphasized by R. A. Fisher, who has derived 
(P: 40 , and P:4S:75) a series of k^statistics for this purpose, where 
the ft’s are estimates of the population (i.e., universe) half- 
invariants, m' denotes the moment about the mean derived from 
a single sample, and M\ is that mean: 



. > 

(M-l)(n-2)'”' 

^4= („ri)(„!:2)(„-3) (m;)*]. 


The actuarial student may be referred conveniently to 'P:t60 for 
additional comments. 


B; 12. The “Probability of Causes”; Bayes’ Theorem, and 
Laplace’s Generalization 

In the mathematical model stated on p. 37, ki (for example) 
is the a priori probability of the existence of Fi\ and tti is the 
a priori probability that, when Fi exists, the event E will happen. 
The product kit^x is therefore the probability that the event E 
originated from Fi. Now we are here concerned with n trials, 
and s successes; if the event E originates from Fi, the probability 
of its happening exactly 5 times out of n is ”Cj7ri(l — tti)""”*; and 
the probability that Fi exists, and that then E happens s times 
in n trials, is consequently "C, 7ri(l — tti)”"”*. There must 
clearly be a similar expression for each one of the conditions 
Fi, F 2 , . . . , Fni and since they are mutually exclusive, the total 
probability that one of the conditions Fi, F 2 , . . . , Fn exists, and 
that then E will happen 5 times in w, is 

S — 7rr)"~* 

r-l 


....( 43 ) 
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It accordingly follows, when we know, by observation, that E 
has actually happened s times out of », that the probability, 
a posteriori, that the particular condition Fa was the origin is 


r-n 

2 Kr tI(1 - 
r«l 


....(43a) 


since the constant factor cancels from the numerator and 
denominator. 

This is the general form of the method which is usually 
referred to loosely as ^'Bayes’ Rule**. It is extremely important 
to realize that the a priori existence probabilities, Kt, as well as the 
a priori productive probabilities, Tr, must be known if the formula 
is to be applicable in practice. Failure to appreciate this fact 
has led on many occasions to misapplication of the rule, and 
consequently to paradoxical results. 

The chief source of these paradoxes has been the tacit assump- 
tion — the 'Trinciple of Insufficient Reason** — that all the values 
of Kt can be supposed to be equal when their real values are 
unknown. If that assumption were sound, (43a) clearly would 
reduce to 


r-1 


....(43b) 


This simplified expression, however, obviously must be used with 
very great care. For Ellis* remark must always be remembered, 
that ''mere ignorance is no ground for any inference whatever; 
ex nihilo nihir (¥ 1 : 24 ) • 

The general formula (43a) represents a perfectly sound and 
logical argument, and leads to unexceptionable results when it 
can be applied rigorously. The famous theorem of Bayes- 
Laplace, in other words, cannot be challenged successfully when 
it is properly employed; the doubts which have been thrown upon 
it have arisen from the improper assumption that (43b) should 
be adopted whenever specific values cannot be attached to the 
a priori existence probabilities, kt. 

Bayes* original treatment of the problem was restricted to 
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the case where the values of #Cr are all equal. The generalization 
when the KrS are not all equal was given first by Laplace (see 
p. 165; A; 9). 


B; 13. The Proofs of the Distributions of (1) jc — m; (2) 

(3) “Student’s 2” = ^ — (4)—; and (5) “Fish- 

( ^ \ (Ts 2(^e 

* 

Suppose that we have a parent population, N in number, 
distributed normally with mean m and standard deviation a 

jV - 

according to the Normal Curve y = 7=- e , which is 

a\/2Tr 

another form of (11). If a random sample numbering «, with 
magnitudes x[f :r2, . . . , is then drawn, the probability that 
the members of the sample will lie between x[ and xi+dx[, X 2 and 
X 2 +dx 2 t , , . j xlt and x^+dx^, is clearly 

;; ^ dxidx2 . . . dxn^ 

0r”(27r)2 

which may be written (as already noted on p. 38) 

[i:{x'^~xy + n{x-my] 

Ae dx[dx2 . . . dx'^ (a) 

where ^4 is a constant and x the sample mean. 

Now consider the members of the sample as a point, P\ in a 
space of w dimensions with rectangular coordinates . . , , 

Then, just as with two coordinates, x and y, we find that the line 
y =0; is inclined equally to the two axes, so in space of n dimen- 
sions (cf. p. 193; B;3) the line =^2 . . =x^is inclined equally 
to the n axes. The perpendicular from P' upon this line may 
similarly be seen to be given by (P'My = (x[ — xy (X 2 — xy + , . .+ 
{Xn’-Sty where M is the point (ic, x, . . . , x). But this expression 
is no ] ; hence P'M = (r^^s/n. The point P', also, must lie upon the 
plane . .-{-Xn — nx in order to satisfy the definition of 


10 
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its mean ; and as this plane is normal to the line = . . . = JCn 

at Mi it follows that P' must lie on a sphere of « — 1 dimensions 
with radius <Ts\/n and centre at M. 

In this space an element of volume, dv^ may be expressed in 
terms of the variation of that is, and the variation in 
surface area, since dSt represents a perpendicular distance incre- 
ment above the plane, and kd(j^~^^ represents an increment to 
area enclosed by the plane within the sphere. [This may be seen 
from the 3-dimensional case where the plane is 
and the sphere is {x[-‘xy-{'{x 2 —x)^’\- {x[ —xY — Z<t ] ; then d5c is the 
perpendicular distance increment above the plane, and d^] is 
proportional to the increment of area of the circle enclosed on 
the plane by the sphere, since the area =3x0-^ and d(area) = 
37rdo-j.] Hence dv is proportional to that is, to (r^’^^dcr^dSc. 

By this device of representing the sample by a point in multiple 
space we therefore now see that, from the probability (a), we 
can write 

[2 (jc; -*)»+«(« -w)»] 

da,dSt ....{b) 

where C is some constant, for the probability that the sample 
will have a mean lying between and and a standard 

deviation lying between and This can at once be 

put as 

c\_e dj [(7r*e ....(c) 

(1) The Distribution of the Deviation ’St—m. 

From the fact that ot and are entirely separated in this 
expression it follows that the distribution of the deviation 5t—m of 
the mean of a sample from the mean of the universet for any and all 
values of o*,, is given by the normal form 

nix-mY 

Cm e d« 

where Cm is a constant. 


....(d) 
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Cm is obtained by making the area of the distribution curve 


r+oo 

ity, or 1 = Cif t 
J -00 


2a^ 


unity, 
p.209:B:7) = 

<rV27r 


dx where x = x—mt which gives (cf. 


(2) The Distribution of the Standard Deviation a. 

Similarly the distribution of the standard deviation of a 
sample, for any and all values of x, is given by the second bracket 
of (c), namely, 

Cs^r^e da. ....(e) 

where Cs is a constant. 


Here may be found as follows (see p. 259; B; 29 for the 
r function) : 


Let 

Then 


nol ^wcr, dx 

— - -X, so that — - = — . 
2<r^ cr^ da. 




= Cs 


= Cs 


1*00 n— 2 n 

{2x)~ (a^)^e- 


dx 


0 n {2nx)^a 

o o ' ' 


’oo n— 3 

i2x)~ 


n-3 


whence 


Cs = 


n-l 

n 2 




This distribution of a, was given first by Helmert in 1876 (see 
p. 165; A; 10). It may be of assistance to note that, since any a 
is necessarily positive, the distribution (e) is, in fact (as was found 
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also by ‘'Student'* in his method of approach — see p. 165; A; 10) 
a skew bell-shaped Pearson Type III curve (see p. 71) limited 
at 0 but extending theoretically to oo. 


(3) The Distribution of ''Student's z" = 




In order now to find the distribution of “Student's" ratio, 
=2;, we may keep in mind the diagram of Figure 20. 

<Ts 

Expression {c) is the probability of a sample mean and standard 
deviation falling in the intervals x rb \dx and (t, zb Idcr,. This same 
expression is also the probability that the “Student" ratio, z, and 
the standard deviation of the sample, o**, will fall in the intervals 
z±:\dz and o-azbi^o-,. To rewrite expression {c) in terms of 
dcr,, 2, and dz^ we replace x—m by 20-^, and dx by cr^dZy and thus 
obtain 




na^ 


dag 




nz 

‘T(72 


M ■ 

<x,dz_ 


— {CsC^) e 


na^ 


(T c (i(T ^ dz 


• •••(/) 


This is the simultaneous distribution of ag and z. It is the prob- 
ability of a sample falling within the elementary shaded area of 
Figure 20. All sample points within the wedge have the same 
value of 2, to within the amount dz\ hence to find the probability 
that “Student's" ratio falls within the interval z±L\dz for all 
values of the sample standard deviation 0-3, we must sum the prob- 
abilities expressed by (/) throughout the entire wedge, i.e., we 
must integrate (/), allowing ag to vary from 0 to 00 while 2 
and dz remain constant during the integration. Putting 

— (1+2^) it follows that 
2a^ 


e a: 


^ T r 

^ * Lw(1+2^)J Jo 


This latter integral (see p. 259; B; 29) is J P 


(f) 


. Inserting this 
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value, therefore, and also those of the constants Cs and Cm^ we 
obtain the distribution of “Student’s” z as 



Another form of proof based on “characteristic functions” 
(first used by Lagrange and later systematically by Laplace — see 
P:^>^^:240 and 264, and P:;^^:23) may be found in P:7^^:336-9 
and P:;^;^:47-8. 



Figure 20. — Diagram showing the region within which “Student's" 
ratio 2 remains constant for varying sample mean x and standard deviation 
All samples falling within the wedge have the same “Student's” ratio s to 

within the amount dz. “Student's” ratio is defined as 2 = . When x 

(Ts 

varies by the amount dx while Os remains constant, z varies by the amount dz, 
which is related to dx by the equation dx = Os dz. 
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The meaning of ‘'Student’s’' distribution may also be made 
clearer by the use of “contours” (see p. 194; B; 3), as shown in 

(4) The Distribution of the Ratio — 

20’s 

Suppose now that we have two estimates, lo-, and 20 ef of the 
same cr, where kt* is based on a first sample of «i variates and 
2 <re is based on a second sample of W 2 variates x^, so that, by (42), 

lo^l r= — and 20 1 = ^ , and the respective “de- 

ni — 1 n2 — 1 

grees of freedom”, say di and da, are therefore di = ni — 1 and 
d2=n2— 1. 

Considering the first sample, we see at once that if we use the 
relation (42) between the sample variance, icrj say, and the esti- 
mate, lolf we have here approximately lol = ( — ) la^, whence 

■ (^) ■ C^) 

(e) in (2) of this section B;13 we can therefore, using these 
relations, write down immediately, in terms of the degrees of 
freedom, di, and the estimate lOg (instead of in terms of ni and 
lor,), that for the distribution of kt, the differential is given by 



As we are to investigate the distribution of — , let us denote 

20s 

this ratio by w; then iff,=W 20 -„ and for a given value of iC, the 
distribution of iv, becomes 
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di V? 2(Xe 

' 2< t ^ 

za^dw (h) 


The distribution of lo-^, however, is again in the form (g) with dz 
and s<rg written for di and lo-^. To obtain the distribution of w, 
therefore, since i<r« and 2 <r« are independent, we merely have to 
multiply that expression by (h) and integrate for values of acr, 
from 0 to 00 , and so find 




\ di+ds-4 

riin 2 -T- 


(diW^-^-d^zCTe 

20-2 


( Jl+dl-1 dl-l\ 




---- 

{diw^+dz) ^ 


(5) The Distribution of ''Fisher's z" -loge \^"fj ' 

It follows immediately that if now, as R. A. Fisher does, we 

write s = log«w=log«( — ) , so that dw = e^dz, the distribution of 
\a W 

^Tisher's z" can be written down from (i) as 


^ * 
2Ji2 r 


(tXt) 


e'^'^dz 

Wie2*+d2) ^ 


from which (472) in the text emerges at once since 




as defined at p. 259; B;29. 
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B; 14. Derivation of the Multinomial Normal Law of Devia- 
tions 

Commencing with the multinomial general term 


N\ 




■■■- 

and replacing the factorials by their approximations according 
to Stirling's formula as in the deduction of the Normal Law (see 
p. 203; B; 5), we obtain 
1 


W^irNy-'- Vpi . . p. 


Writing = 7 r, each factor ' 

VNPr \fr/ 




-(jy NPr+Npr+i) 


In order now to obtain an approximate expression for the product 
of all these v factors when N is large we take the logarithm of 
the product, expand, collect terms as in the corresponding proof 
at p. 204; B; 5, and find 

1 

vpi ..p/ "" 

subject to a similar condition as the Normal Law, namely, that 
the approximation may not be satisfactory if Npr is less than 
about 10 (see p. 267; C; 4, and p. 310; C; 14). 

It will be noted that the conditions / 1+/2 + • • .+fl = N 3.nd 
Npi+Np2+. . . +Np^ = N mean that pi+p2+, . . +py — 1, since 
all the N cases must be disposed of into the v cells. 


B; 15. The Derivation and Characteristics of Poisson’s Law 
of Small Numbers 

We here seek to simplify the probability for np + x successes 
and nq—x failures in n trials, i.e., the fundamental formula 


nl 


(np+x ) ! (nq—x)! 


pnp+Xgnq—x 


...( 2 ) 
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under the special conditions of q being so small and yet n suffi- 
ciently large that nq=m, a small but finite number. [In this 
demonstration q is supposed to be small rather than p, since in 
those actuarial problems to which Poisson's formula is particu- 
larly applicable it is g, — the rate of mortality at age x — rather 
than the complementary probability of survivorship, pxt which 
usually is small, so that it may assist the actuarial student to 
think of q as being small.] 

Writing for simplicity nq—x — r^ (2) becomes 


n\ 


r\(n-r) 


n\ 


nrpn 


...(a) 




_ n{n — l). . • («— >'+l) ( 5 ) 


...(c) 


( m\^ 

1— —j ^ e 

and hence 

if n is large compared with mr, so that (h) becomes 

Here, however, the product of the first r — 1 brackets is a decreas- 
ing alternating series when expanded, namely, 1 — + . . . , 

r(r--l) 

* and therefore lies between 1 and 1 — — - — , and consequently 

2n 

tends to 1 when 2n is large compared with r(r — l). The expres- 
sion therefore is, approximately. 


, the Poisson exponential .... (55) 

rl 

Alternatively, the reduction may be effected by introducing 
Stirling's formula (9) for the factorials nl and (n — r) ! in (a), thus 
obtaining 
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(n-rye-’’ 





...(d) 


But n being large, and r relatively small, 


by (c) above. Also, = as may be seen 

by expanding both expressions. Making these substitutions, (d) 


therefore becomes 


rl 


under the conditions assumed, we 
before. 


; and since (n — 

reach = — as 

r! r! 


The mathematical analysis of the error involved in the approx- 
imation represented by the Poisson exponential is given in 
P;f4ff:135-137. 

The preceding proofs evolve Poisson’s result by imposing 
special conditions upon the point binomial. The formula, how- 
ever, may also be deduced by using Thiele’s half-invariants (see 
p, 257; B; 28) and requiring that the half-invariants of orders 
higher than zero shall all be equal (see P:S^:265). 


In the foregoing deductions m, which was written for the 
expected number of happenings has emerged as the single 
parameter by which the formula is determined. It will be noted, 
however, that, since (55) represents the probability of r occur- 
rences in n trials, it is a discrete function which exists only for 
integral values r=0, 1, 2, . . . , w, and consequently that the mean, 
standard deviation, etc., of the distribution should be determined 
in practice from that range only. The formula, moreover, is not 
a true probability distribution; if it were, the sum of the prob- 


r-n 

abilities for all possible occurrences would be unity, i.e., S 

r-O 


m'e~”' 

~~W~ 


would be 1, whereas in fact this expression is unity only if the 
upper limit of summation is extended from « to « . 

It fnay, however, be shown as follows that the Mean = m. 
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By the method adopted for the point binomial in the proof of (4) 
in the main text, and so taking the origin at the beginning of the 
range, the mean is 


f-n 

S e- 

r*0 





V 

\) 1! ' 2! ^ ' 

^(w-DU 


Again here, if the limit of summation were extended from w to oo , 
this would be me~~”^e^ = m\ when, however, n is merely large 
enough that m is finite, this result is only approximately true. 
Hence Mean ^ m. 

Similarly, by the method of proof for (5), (6), and (7) in the 
text hereof, the second moment about the origin is 



If, therefore, — the mean square deviation about the mean — be 
taken with reference to the approximate mean, w, we have 


The first term here is 



= me 


— me 


TT TT J 

[l+fi(l + I) + f(.+2) + ...+^, 

fi 4 - 4 - _ . - 1 - 

L 1! ^ 2! ^ ^(»-l)!j 


(1 + 


«-l)J 


+»■« ■[l+jr+2!+ "+5r=2y!j 


Now when n is large, as is here supposed, each of these brackets 
is not very different from e*”, so that this first term = w + m^. 
Similarly, as in the proof for Mean ^ w, we may write 
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The whole expression for therefore gives 

—2me'^”^(me*^) +m^e''^(e*^)] —m. 

Tables of Poisson's Exponential Function 

A four-place table of — j — for values of m from .1 to 10 

r! 

was published in H \75 by Bortkiewicz in 1898. A six-place table 
for values of m from .1 to 15, and for r from 0 to 37, was computed 
by H. E. Soper in 1914 (Biometrika, X, 25), and is reprinted in 
P:97. In this latter volume of ^‘Tables for Statisticians and 
Biometricians** are also to be found tables prepared by Lucy 
Whittaker to facilitate comparisons between the results of 
Poisson*s function and the **normar* theory. 


B; 16 . Edgeworth’s Generalized Law of Error 


The main objective of Edgeworth*s investigations of the 
‘law of error** was to discover, from general a priori conditions, 
that true and unique law which could properly be held to repre- 
sent the frequency distribution of a magnitude depending on a 
number of independently varying elements. 

The problem was dealt with in his many papers by several 
different methods, of which Bowley gives an excellent account 
in H:f^;^:29-35, 39-47, and 134-5. The derivation is there shown 
on the assumption that there are “m elemental frequency groups, 
such that the chance of drawing a magnitude from, say, the qth 
group is where (p is an unknown function and aq the 

average of the group”. Drawing then magnitudes pfi, p{ 2 , ...» pfm 

Q">m 


from the m groups to form an aggregate S (a^+pf^), or A + pX 

2-1 


where S (pfg), and introducing certain postulates, he pro- 

ceeded to determine the moments of the frequency distribution 
of the values of pX, and found them to be the same as the mo- 
ments of his Generalized Law of Error (58). The method may 
be compared with that originally used by Laplace and Poisson, 
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from which a form equivalent to (58) can be obtained as shown 
in P:i55:168-172. 

A simple algebraic approach results also from taking into 
account the terms neglected in the derivation of the Normal 
Law of Deviations, as shown on p. 205; B; 5. 


B; 17, The Derivations of the Gram-Charlier and Poisson- 
Charlier Series 


The first determination of the constants of the Gram-Charlicr 
Type A series was effected by Gram (R:59) through the use of 


a least squares criterion that 


•+O0 

— oo 


shall be a 


minimum, where y'^ denotes the observed values (see P:5^?:203“6, 
and P:n^:169-170). The weighting of the squares of the devia- 


tions with appears to have been adopted by Gram, without 

Ax) 

comment, on account perhaps of its algebraic convenience (see 
P:Jfi^:174). 

Another derivation employed by Wicksell which 

leads to the same result, is to develop the point binomial (3) by 
Laplace’s method of generating functions (see P:n^:156-161). 


The values of the coefficients can also be obtained from 
the fact that the derivatives ipnix) of the normal function 
_ 1 

<p{x) = 7^ e ^ , and the Her mite polynomials 

O’ V 27r 


Hn(x)=x^- 


njn-l) «(w-l)(w-2)(w-3) _ 

2 2.4 


form a biorthogonal system satisfying the relations 


r+oo 


^n(x)Hm(x)dx = 0 wh^n mT^n^ and 


<Pnix)H„{x)dx = 

— oo 


(-lynl 

(T 


otherwise (see P:n^:165-8, and P:56r: 199-202). 

The determination of the Poisson-Charlier series (65) by 
WickselPs development of the point binomial is given conven- 
iently in P:n^:161-4. 

In connection with these methods of representing frequency 
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distributions by means of series it may be well to emphasize that 
their practical success obviously must depend upon rapid rather 
than ultimate convergence. In actual application it is not 
desirable to proceed beyond the first three or four terms, so that 
it may be possible to describe a given frequency distribution by 
a few constants only (cf. P:Jf4^:39 et seq., and P:f 


B ; 18 . Transformation of the Variable 

Suppose that the frequencies for a variable x do not accord 
with those of the Normal Curve ( 11 ), but that the corresponding 
frequencies for some function of x, say /(rr) = 2 , do so. Then the 
relative frequency of the values between z and z+dz may be 

1 

written, from ( 11 ), as e dz — F{z)dz, say. Since z—f(x), 




dz 


we have — —f{x), or dz—f{x)dx, and the relative frequency of 
dx 

the values between z and z+dz, or F{z)dZt becomes 
1 (/(«)]» 


Cy/ir 


fix)e 


dx. 


This ^ 'transformed' ' frequency function, being expressible as 
(p(x)dx, now evidently represents a distribution of relative fre- 
quencies for the variable x. 

As an example, if s = \/Xf so that f{x) = \/x and f{x) = » 

2\/x 

we can from the above write down immediately ip{x)dx=^ 
1 - — 

dx; the transformed frequency function is con- 

2c^/irx 

1 -- 

sequently <p(x) == 7 = e . 


2cy/Trx 


Again, if s = log =/(*) >we have/'(»:) = ^ 


and <p(x)dx^ 


\/2(ac— a) 
rs? P®* (“ )] dx, which is the log- 


cV^(ic— a) 

arithmic frequency function (70) discussed in Chapter VII. 
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B; 19. The Fourth Degree Exponential as the System of 
Curves for which Moments up to iii Provide the “Best” 
Method of Fitting 

In Chapter VIII on the Fitting of Curves it is pointed out 
that the equations for the determination of the constants in the 
fitting process known as the Method of Least Squares (the theory 
of which is founded essentially on the assumption of normality 
in the errors of observation) will give, when those equations are 
duly ‘‘weighted** (so that the standard deviation is uniform 
throughout), results very similar to those of the “unweighted** 
equations of the Method of Moments (see p. 243 ; B ; 23) in the case 
of the representation by an exponential form/^ = e“®'*'‘“*'*'’ ’ 
of a frequency distribution for which the weights may be taken 

as approximately • 

f X 

The same principle appears, in effect, in R. A. Fisher*s con- 
clusion (P:57:356) that the use of moments up to m 4 (as in 
Pearson*s system) is, on a criterion of “efficiency** implying the 
minimum variance, the “best** system of fitting when it is applied 
to a fourth degree exponential (72). Observing that the first two 
moments have “100% efficiency** for the Normal Curve, Fisher 
points out (loc. cit.) that for symmetrical curves of the Pearson 
type the method of moments “has an efficiency exceeding 80% 
only in the restricted region for which jSa lies between the limits 
2.65 and 3.42 and for which does not exceed 0.1**. In then 
determining the system of curves for which moments up to ijn 
constitute the “best** method of fitting, he remarks that “if the 

d 

frequency in the range dx be y(x, 6i, $ 2 , ^ 3 , di)dxj then — log y must 

dS 

involve x only as polynomials up to the fourth degree**; conse- 
quently as in (72) — “the convergence of 

the probability integral requiring that the coefficient of x^ should 
be negative, and the five quantities a, ao, ai, ai, az being con- 
nected by a single relation, representing the fact that the total 
probability is unity** {ibid.). 
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B; 20. The Genesis of the Verhulst-Pearl-Reed (the * ^Log- 
istic”) Curve of Population Growth 


The conditions assumed in Verhulst’s original deduction of the 
logistic form (102) with ^4 =0 were that the proportionate rate of 
increase over time, /, of a population, Pu growing in a restricted 
area, will tend to become less and less as the population becomes 
greater. 


If the population were growing over time in a geometrical 
progression, so that P< = af^ then the proportionate rate of 

growth, — j » would be a constant log« r = m, say. In order, 
Pt \ dt / 

however, that this rate of growth shall gradually decrease we 
may write, as the simplest assumption^ ^ g) = m — nPt 


where » is a constant. The solution of this differential equation 
then gives immediately the logistic expression in the form (102a) 
below (cf. P:/7^:4, 43, and 44). The fundamental principle of the 
curve is thus the assumption that the instantaneous proportionate 
rate of increase is a decreasing linear function of the population. 

Formula (102) is a convenient form employed by Cramer 
(P:;^S:200). 

Putting i4=0 in (102), and multiplying numerator and 
denominator by the constant = C we obtain , whence 


also 




B 

l+Ce-** 


B' 

C'+e-*‘ 


. . . . (102a) 


as alternative forms which are frequently used. 

Another useful type, which is preferred by Yule (and is exam- 
ined fully in V -.176:5 and 46 et seq.), is 

P = — ^ ....(102b) 

1+e * 

where L is the limiting population Poo» « determines the hori- 
zontal scale, and is the time from zero to the point of inflexion. 
The simplest expression (see P:176:5-6) results from choosing 
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the scales so that L and a are unity and the point of inflexion is 
at zero time, so that 




1 


....(102c) 


Exhaustive discussions of the hypotheses underlying the 
logistic may be found particularly in P:Pff:567 et seq., 

and P:ff0. 

The mathematical relations between the summation of two 
logistic curves and the generalized form (103) are examined in 
P:f 05:742. 

The various methods of fitting which have been employed in 
practice are indicated at p. 321; C; 19 and p. 327; C; 21. 


B; 21. The Specification of the Parent Population from the 
Observed Values of a Sample 


The problem of specifying the true, i.e., '*parent*^ population 
from the observed values obtained by sampling is approached 
easily by using the Principle of Maximum Likelihood, which has 
already been noted on p. 39 (Chapter V). In order to give a 
simple illustration it will be of assistance to consider the problem 
of determining the parent population from which observed rates 
of mortality are drawn. Suppose, therefore, that the true rate 
of mortality at a particular age in the parent population is an 
unknown quantity, g, of which the best obtainable estimate is 
required so that the population may be specified thereby. Sup- 
pose, also, that persons, drawn as a sample from that parent 
population, have been observed, and that 0' of them have died. 


0f 

so that the observed rate of mortality, g', is — . Then, as in 




the derivation of (1) in Chapter III, the probability that 
out of E' exposed to risk there will be 6' deaths, when the 
true rate of mortality is g, is Clearly the 

most probable hypothesis for the unknown g will be to assign to 
it that value which will make this expression a maximum. 
Taking logarithms, differentiating, and equating to zero, it is 


17 
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seen at once that the maximum value will be given when g = — • 

That is to say, the most probable estimate of the unknown q of 
the parent population will be provided by the observed value 



The preceding application of the principle of maximum 
likelihood can be extended immediately to the specification of a 
parent population which is characterized by more than one 
unknown parameter, as in the case of the multinomial distri- 
bution discussed in Chapter VI. For, as there, suppose the 
parent population consists of N persons distributed into v cells 
with the unknown true relative frequencies pu pit * > • y P/> then, 
if a sample of iV' persons is drawn and is observed to contain 
Sv /a* • • • » /» cases respectively attributable to the v cells, what 
values for the true probabilities pi, pz, . . . , p„ will give the 
maximum likelihood to the sample drawn, i.e., what is the best 
estimate that can be made for those probabilities, by which the 
parent population can be specified? As in (48) and (49) the 
probability of drawing the particular sample observed is 


in which 
and 


N'\ 




pi 


/; 




=Ps say 


+/;=iv' 


....(iii) 


Now log Ft will be a maximum when P, is a maximum; we 
therefore take the logarithm of (i), and differentiating and equa- 
ting to zero find 


/i . I . . I 

Pi pi 


+ ^Sp, — 0 
P» 


....(iv) 


where for the variations dpi ... dp, in Pi, , p, we have, from 
(ii), the condition 

dpi-\-dpt-\- . . . -\-dp, = 0 . . . . (v) 

f'l Si s'. 

From (iv) and (v) evidently--^ = — ^ = . . . = =o, say. 

Pi P» P, 
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Consequently . . ; fl—ap/, whence 


S[ 

pi 


II 

p, 






, which by (ii) and (iii) =iV'. 


pi+ ... 

fl /2 fl 

Finally, therefore, it follows that Pi- jq, \ • • • \pp^'^r 


That is to say, the assignment of the maximum value to the 
probability of drawing the sample actually observed means that 
the most probable estimate of the parent probabilities, pr, will be 


/; 


provided by using the observed relative frequencies ~^,{~p'ry say). 


B ; 22. The Classical Method of Approximation and Correction 
for the Least Squares Fitting of Transcendental 
Equations 

If the curve to be fitted is a, /3^ . . .), and approx- 

imate values a', iS', . . . have been found for the unknowns a, /3, 
. . . , so that a = a'+5a, /3=iS'+5i3, . . . , the problem becomes one 
of fitting the curve y^=/"(5c; a'+6a, /S' + SjS, . . . ), and hence of 
determining the corrections 5a, 5/3, ... , which therefore now are 
to be considered as the unknowns. Assuming that the approx- 
imate values a', /3', . . . are sufficiently close to a, /3, . . . that the 
squares and higher powers of 5a, 5/3, .. . may be neglected, yl can 
be expressed by Taylor’s Theorem in the form 

fix; «', /3', . . . )+5a[~r(^c; a', /3', . . . )] 

• )J + 

Since this is to be fitted to the observed series of /*’s, and the 
numerical values of/"(x; a', /8', . . . ) are available by computa- 

tion, while those of — ■ f"(x; a\ /3', , are also dedu- 

da 

cible by differentiation and calculation, the problem becomes 
simply one of fitting the linear function ka(Sa)+kfi(dp )+, . . to 
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the numerical values of [/*— a', jS', . . . )], where . 

are constants of which the numerical values are known, and 
5a, 5)8, . are the unknowns sought. The method of least 
squares can therefore be applied directly by the formation of linear 
^‘normal” equations in the form already stated and with the 
weights being employed in the manner previously given (cf. 
P:51 :121 and H 44 *169, in both of which, however, the definition 
and use of the weights in the formation of the normal equations 
must be watched carefully to avoid confusion — see p. 322; C; 20 
here). 


B ; 23. The Approximate Equivalence of the Weighted Equa- 
tions of Least Squares and the Unweighted Equations 
of Moments in the Case of an Exponential Function 
Representing a Frequency Distribution 


In Chapter VIII the formulae, (111) to (113), were given for 
the weights of several ratios, such as W{qx]y etc., which are 
required in the fitting of any curve to such ratios by the method 
of least squares — these formulae being obtained from the mathe- 
matical model of exposed to risk at age x being subject to an 

, , , 

observed rate of mortality where — and the 

corresponding true parent probabilities are g* and px- Such ratios 
by age are not, of course, frequency distributions; they are simply 
curves depicting the progression of the ratios from age to age. 
The observed deaths, however, do form a frequency distri- 
bution. The weight, W[dx]y of each term, by (114), is then 

— — ; at most ages, as in (116), this may often be taken as 
ExPxQx 


approximately — 7 — , which again will be given roughly by — » 
ExQx ExQx 

that is, by the reciprocal of the graduated values of the frequency 
distribution. It therefore follows that in the fitting of a curve 
= to 3, frequency distribution^ the weights may be taken as 


approximately^ (cf. P:fS7:368, and P:5I:129 — remembering 
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that in the latter y/Wx- is used, as noted at p. 322; 
C; 20 here). ^fx 


Now the exponential function “ can 

represent approximately most frequency distributions. But here 


~{f'x) = x^/x; and consequently, since Wx 
dai 



as above, the 


weighted normal equations (116) of least squares becomes approx- 

imately S (/*-/;^) («"/*) J =0, that is, =0 for 

n=0, 1, 2, . . . , which are the unweighted equations (119) of the 
method of moments. 


We consequently see that in the case of a frequency distribution^ 
for which the weights can be taken as approximately --y , , and 

f X 

which can generally be represented roughly by an exponential 
= \ the weighted equations of least squares 

may be expected to lead to practically the same results as the 
unweighted equations of the method of moments. 


[It should be noted in the preceding demonstration that the 
approximate weights involve, of course, the unknown para- 

J X 

meters of which the values are being sought, but that their intro- 
duction into the argument at the stage of the differential form 
(116) results, through the subsequent cancellation oi fx, in the 
elimination of the difficulty which would arise in the differen- 


tiations if —^7 were inserted for Wx in (109). For the weights 

f X 

Wx in (109), being dependent only on the process of selection by 
which the observed /'*s are derived from the parent /s, are inde- 
pendent of the unknowns which are involved in the/^*s; they 
are therefore to be treated as constants in the differentiations of 
(109), and consequently so appear in (116). But if in (109) Wx 


could be taken as 




, it would evidently involve the unknowns. 


and the differentiations of (109) would immediately be compli 
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cated. The insertion of \ as an approximation to Wx in the 

f X 

derived normal equations (116), however, will evidently lead to 
close results and is sufficiently justifiable under all the conditions 
assumed.] 


B; 24. The Relation between the Criteria of Weighted Least 
Squares and Minimum-x^ in the Graduation of Mor- 
tality Tables 


As shown by (109) — but dropping the limits of summation 
for convenience here — the method of least squares imposes the 
requirement that 'I^yWxifx'^fxY] shall be a minimum, where/* 
are the observed values,/* the fitted (i.e., graduated) values, and 
the weights Wx are based on the true (i.e., parent) values and are 
thus independent of/* and also of the graduated values,/^, which 
are being sought. 

By this method, therefore, a graduation of a series of 
observed deaths, 0 *, by means of a fitted series ^*, requires that 
be minimized; and since by (114) the weight 


w{0',} i 


IS 


^xPxQx V ^^)^T 

posed by the method of least squares is that S ^-^ 7 — ^ must 
be a minimum. ^ ^xPxQx J 

Precisely the same result is reached, moreover, in a weighted 
least squares graduation of a series of observed rates of mortality, 

<=l) , where the graduated rates, g*, are applied to the 

observed £* in order to give a series of adjusted deaths = JS*g*. 
For then the least squares expression is S[TF{gi} (g*— g*)*]; and 

f 

since by ( 111 ) the weight TF{g*} is — this expression to be 

PxQx 

minimized becomes S » which is here to be taken 

“ - 1)1 


, we see at once that the condition here im- 


same con- 


dition as before. 
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In the case of the minimum-x^ method the criterion, as ex- 

■ riArl Jn efri ♦■otviian f* r\f /lOl^ Je 4'V»r»+' S ^ "1 g|^^|| j^0 


plained in the statement of ( 121 ), is that S 


simply that 2 


must be a minimum. This interpre- 


minimum. Being obtained directly from the Multinomial Law 
of Deviations, which is based on a mathematical model of the 
various values of the observed series,/^, falling into cells so that 
the whole series is completely distributed, the minimum-x® 
method is evidently applicable only to cases which can be viewed 
as frequency distributions. For the graduation of mortality 
data, therefore, we can contemplate either (i) a direct graduation 
of the observed deaths, 0 *, which can be visualized as a frequency 
distribution, or (ii) a graduation of the rates of mortality, g*, in 
such a form that we are in fact dealing with a frequency dis- 
tribution. 

In (i) — a direct graduation of 0 ^ — the criterion ( 121 ) becomes 
simply that 2 ^ ^ ^ \ ^ minimum. This interpre- 

tation of the X® method is illustrated by the calculation of x® for 
Weldon’s dice data at p. 334; C; 25, and by the analogous exam- 
ple of a simple distribution of which is also given at p, 339 
thereof. The minimum-x^ method of Cramer and Wold, more- 
over, follows this approach — for their minimum-x^ gradua- 
tion process is based on the principle of using (see P:^S:172) 

S j where =£* 3 '*. 

In (ii) — the graduation of rates of mortality, g*, hut in such a 
manner that we are in fact dealing with a frequency distribution of 
deaths — it is to be remembered that at each age the observed 
data are E^q^^Bx who die, and Expx (1 ~ 3 *) who do 

not die, and that the graduation finds for them the fitted values 
ExQx^^I who die, and £^p^j[=jE*(l— g^) =£*— who do not 
die. At each age, therefore, the contribution to x* will be 

. [(£:-^'')-(£:~^;)p 

— 1 — -pf , as in the binomial verification 

“x Ex Bx 

of the Multinomial Law (50). This expression is 




1 

77/ .r if > 

'^xPx-i ^xPx^x 
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since = T1 
fore requires that S 


lie minimum-x* principle in this case there- 


V tt 1 


I — T-inr I shall be a minimum. A numer- 

ical example of the calculation and use of this expression as the 
X* test of goodness of fit is given at p. 341 ; C; 25. 

From the preceding analyses it will be seen that the weighted 

he rates of 

shall be a 


a minimum in a 


least squares graduation of either the deaths, or t! 

0^\2 

mortality, g*, imposes the condition that S — — \ 

L ^xPxQ.x J 

minimum; the mininum-x* method, on the other hand, requires 
that S r =S shall be a minimum for a 

L J L J 

^ y shall be 

^ ^xPx9.x J 

graduation of In the least squares method the true values 
px and g» must in practice be estimated by an approximate pre- 
liminary graduation (see p. 96); if these estimated values are 
reasonably close to pi and ql (^s they should be by any proper 
method of determining them), the result of the weighted least 
squares process will evidently be very similar to that of mini- 

^ y , which is the theoretical criterion for mini- 
^xPxQ.x -J ^ 

mum-x* in a graduation of ql. It is, however (cf. p. 102), diffi- 
cult to minimize such an expression, for it involves the pi and 
ql, which are being sought, in the denominator; a practical 
device, therefore (as Cramer and Wold propose in their mini- 
mum-x* graduation of $1)^ would be to substitute the observed 
pi and qlj or to determine (as for least squares) approximate 
values for px and g* by a preliminary graduation. On either of 
these approximate bases it is clear that the results by the weighted 
least squares and the minimum-x* conditions will be very similar. 


B; 25. The Frequency of Changes of Sign 

Since the fact does not appear to be generally known, it may 
be of interest to record here that the first discussion and use of a 
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method for examining the changes of sign in a mortality table 
graduation appears to have been given in 1876-1878 by De Forest, 
whose pioneer work on graduation by linear compounding con- 
stitutes one of the most remarkable accomplishments to be found 
in actuarial literature (see p. 284; C; 7, section {xii), and V:166). 
His first consideration of the problem was in H:4^:29-35, on the 
following lines. 

(a) In a periodic series (such as a sine curve), for which the 
first and last terms are consecutive, the probabilities that any 
particular 2, 3, 4, . . . signs for the deviations between the ob- 
served and graduated series will be alike are evidently J, J, . , 
on the assumption that positive and negative deviations are 
equally likely to occur. The probability that any group of r 


consecutive signs will be the same is therefore 


1 


Further- 


more, the probability that any group of like signs will be isolated, 
so that the signs next preceding and following the group will be 
different, will be (^) (^). The probability that any particular 

group of r signs will be alike and isolated is therefore . If N 


be the total number of signs in the periodic series, the expected 

N 

number of isolated groups of r like signs is consequently - • 

1 ^ 

Writing this as Np, where p = —— , we see also, by formula (8), 


that the standard deviation of this number would be ^ Npq on 
the assumption that the occurrences of the signs are independent 
(which, however, is open to question); and if, with De Forest 
(following the practice of his time), the ‘‘probable error” be 
adopted to indicate the limits of variation as in (21), we might 
say that in a periodic series of N terms the expected number 
of isolated groups of r like signs can be computed, with its prob- 


N . *6745 / — 

able error, as — ± — \/iV(2^+*- 


■ 1 ). 


N 

The expression may also be reached clearly as follows 
(see Hilld :146) : If in a series of numbers the conditions of selec- 
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tion are such that odd and even numbers are equally likely a 
priori^ then the probability of any particular number being, for 
example, an even number is An isolated sequence of exactly 
r even numbers (i.e., the non-occurrence of change from even to 
odd) will occur through the appearance of r even numbers fol- 
lowed by an odd number (the appearance of the odd number 
being essential to terminate the sequence of the even numbers). 
The probability of a sequence of r even numbers is therefore 


mh) 


— , and in a total of N numbers the expected number 

2r+l ' ^ 


of sequences of r even numbers is , as before. 


Visualizing 


now the occurrence of + and — signs in a series of differences 
between ungraduated and graduated rates of mortality, it follows 
that the infrequency of changes of sign to be anticipated as a 
result of a merely chance distribution of such changes (as might 


occur in an ideal graduation) may be measured by taking 


N 

2r+l 


as 


the number of times a sequence of r plus or r minus signs may 
be expected to occur. 

(6) De Forest also observed (¥1:51 :111) that since with r = l 


. N 

the expected number of sequences is — , and when r = 2 it is 


N 

I 

8 


so that the number of signs which occur singly or two alike is 
N /N\ N 

1-2 ( — I = — , ^‘we have this practical rule — that if a series 

4 \8/ 2 

has been well adjusted, the whole number of signs . . . which fall 
within groups of only one or two like signs each will probably be 
about equal to the whole number which fall within groups of 
more than two’*. 

(c) From the preceding it follows that the average number of 

N 

sequences of all orders is obtainable by summing for all 


possible values r=»l, 2, 3, . . . . The limit, as N approaches oo, 
becomes^ |^i+(i)*+ * * J = 

For a non-periodic series, such as rates of mortality, De Forest 
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pointed out (in HiBl :111) that the preceding methods based on 
N 

, which strictly suppose that the series is periodic, will also 

give a close enough approximation provided that the first and 
last signs are treated as consecutive so that they will be considered 
as belonging to the same group if they are alike. 

In order, however, to avoid that approximate method of 
dealing with a non-periodic series, De Forest also suggested 
(H:49:32) that the first and last signs be omitted — for they 
cannot strictly be included within any group, because they are 
not consecutive, and it is not known whether or not they are 
isolated (the next sign beyond being unknown). If we thus omit 
the first and last signs in a non-periodic series of N terms, it will 
be clear that the number of possible groups of r consecutive signs 
will be N—r — 1 (for example, in 5 terms there is one middle 
group of 3 ; in 7 terms there is one middle group of 5, 2 groups of 
4, 3 groups of 3, etc.). The expected number of isolated groups 

N-r-] 


of r like signs is therefore 


2r+i 


-i(si 


since the probability of any 


particular group of r signs being alike and isolated has already 


been established as 




with a corresponding probable error 


(if desired, and subject to the validity of assuming that the 


occurrences are independent) of db 


Furthermore, if the complete non-periodic series is to be 

iV'—r — 1 

examined, we must add to the preceding — — the two terms 

involving the first and last signs. Starting with the first we note 
that the probability of its being positive is ^ ; the probability then 
of r — l more positive signs (to make the sequence of r alike) is 

; and the probability of the next sign being negative (to ter- 
minate the sequence of r) is ^ ; consequently the probability of r iso- 
lated positive signs starting with the first is ^ f ~ ~ ’ 

2 \ 2 *^ 2 2 ** ^ 
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first is 




The probability of r isolated negative signs starting with 

the first is similarly . Consequently the total probability 

of r isolated positive or negative (i.e., like) signs starting with the 

. By exactly the same reasoning it 

is evident that the probability of r isolated positive or negative 

(i.e., like) signs ending with the last is also — . The total ex- 

2 ** 

pected number of isolated groups of r like (positive or negative) 
signs over the whole range of a non-periodic series of N terms 

N—r — 1 1 1 

(including the first and last) is therefore ; H 1 

Or+l 2 »‘ 2 ’* 

= — 2 ^+ 1 — * given by Seal in P:/;^5:29. 

More elaborate discussions of these principles are also avail- 
able in De Forest’s papers and H:55:71. 


B; 26. The Most Probable Value of the Mean Square Error 
of an Observation when there are v Observation 
Equations, and k Unknowns have been Determined 
by the Method of Least Squares 

The method of least squares is based on the assumption (see 
Chapter VIII, and p. 322; C; 20) that the observed values, /y, are 
drawn as a sample of the true values, /r, and that the determina- 
tion of the unknowns by the solution of the “normal equations” 
will give the best fitted values, The true errors are thus 
fr—fr\ but after the fitting has been completed the residuals, 
fr --fr say, will still remain. Furthermore, it will be remem- 
bered that when the observations, are not all equally well 

drawn, the residuals become invested with “weights”, IFr== » 

CTf 

so that the whole process actually makes the sum of all the values 
of Wfvl, or (jS^WrVr)*, a minimum. 

In order to simplify the proof to be now given (for which 
cf. P:IS:100), let us write Vr for X^WrVr — that is, in the usual 
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language of the text-books, let us suppose that “each observation 
equation [/r— /^= 0 , being t^r = 0 ] has been multiplied by the 
square root of its weight [\/ Wi\, so that the residuals are all 
reduced to unit weight [Vr\\ Then, on the assumption of the 
Normal Curve, the a priori probability of the whole series of 

/ 1 

reduced deviations or Fr, is ( ) e by (104), 

where the parameter c remains to be found. 

Now the object of the fitting process is to determine the 
unknowns in the expression to be fitted, which we may suppose 
is f,=r(x; a, 7 , . . . ), so that the unknowns are a, jS, 7 , . . . , 
numbering, say, k. Since these unknowns are independent of 
each other, and might (if we adopt the Principle of Insufficient 
Reason — see p. 38, and p. 181; (f) of B; 1 , and p.‘'222; B; 12) 
each have any values whatever between — 00 and + 00 , the total 
probability of the system of residuals under consideration is 

+c» r+00 r+00 / 1 \ V _ ±JL 

... ( — — ) e dadpdy ... y (i) 

— 00 J —00 J-OOVVTT/ 

where there are k integrations. 


In performing the integration with respect to the first un- 
known, a, we may express the terms in 2 7 ^ which involve a in 
the form (A a +By. Then 




(Aa+B)^ 



by putting — =/ and using ( 6 ) of B; 7. The probability (i) 
c 

with a thus eliminated becomes 



where now k — 1 integrations remain, and is the quadratic 
function of /3, 7 , . . . which follows from 2 when a is chosen, by 
the method of least squares, to make 2 7 * a minimum. Repeat- 
ing the same argument for each of the k integrations we see 
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2 


V-k _ 




where K 


that the final probability reduces to K \ 

is a constant (absorbing the factor in the denominator), and 
2J Vl is the value of S when all the k unknowns a, jS, 7 , . . . are 
determined by least squares. 

To find now the value of c for which this probability is a 
maximum, we take logarithms, differentiate with respect to 


c S V 

and obtain — or . That is to say, the mean square 

^ ‘k 


error of an observation of unit weight when there are v observa- 
tion equations and k unknowns is 


[Wrifr-Kn 

r»l 

v — k 


(ii) 


where the unknowns in f have been determined by the method 
of least squares. 

The preceding proof is simply an extension, to the case of k 
unknowns, of the demonstration of Bessel’s formula (42), namely, 

which is < 72 = — — for the case of one un- 

\»-l/ n-l 

known and uniform weights, as is shown in the proof of the latter 
form on the basis of the Principle of Insufficient Reason at p. 38. 
The proof just given, and the resulting formula, are consequently 
open to the criticisms which may be levelled against this principle 
(see p. 39 here, and the references there noted). 

The alternative method of proof for Bessel’s formula (42) 
which is given on p. 36 may likewise be extended to the case of k 
unknowns, asshownin P:50:80-82 (see also P: 165: 205 and 243-5). 

The mean square error of **an observation”, as it is contem- 
plated in this formula («), is evidently the “most probable” or 
“presumptive” value (see p. 219; B; 11) of the mean square error 
of a fitted value which has been determined by the method of 
least squares. It is not the mean square error of any particular 
term of the fitted series; it is a hypothetical value, typifying the 
accuracy of the graduation as a whole, to which a “most prob- 
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able’' value has been assigned by maximizing the probability of 
actually obtaining the system of residuals resulting from the 
determination of the k unknowns from the v observations by the 
method of least squares (cf. P:1^4*145, and P:P0:137). 


B; 27. Moments 

For the point binomial ( 3 +^)" it was shown, in the derivation 
of formula (4), that the mean is np, and in formula (5) that the 
average of the expected squares of the number of happenings is 
np{np+q). The latter was obtained as the sum of the second 
powers of the variable each multiplied by the frequency; the 
former as the sum of the first powers multiplied by the frequen- 
cies. Consistently with these we could also take the sum of the 
0th powers of the variable multiplied by the frequencies, namely, 

. .+/)”(w)^ = (g+;^)^ = (l)" = l, showing, of 
course, that for the point binomial {q+p)'^t which represents a 
distribution of probabilities, the ‘Total frequency” is 1. These 
are simply three cases, for r = 2, 1, and 0, of a process of summing 
the fth powers of the variable multiplied by the appropriate 
frequencies. 

If, therefore, instead of a point binomial probability distri- 
bution for which the total frequency must be 1, we have, in 
general, N cases altogether with frequencies /i, / 2 , . . . , /n corres- 
ponding to values Xu X 2 i . • • ^ Xn of the variable, then, as for the 
above point binomial, 

the total frequency, N, is fi{xiy+f 2 {x 2 y+ • . . +/n(^n)® 

=/l+/2 + ...+/n= 

/-l 

the first powers give /i(ac i)‘ +/ 2 (*s)‘ + . . • +fn{xny= S ftixtYl 

t-i 

/-n 

the second powers give/i(xi)*+/ 2 (jcj)*+ . . . +/n(»n)*= S ftixtY; 

t - 1 

and generally 

t»n 

the rth powers give /i(*i)'’ +/*(**)’’+ . . . +/n(*n)''= S ftixtY. 

/-I 

If we reduce the values to a unit of frequency, and define the 
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rth moment (per unit frequency) as the sum of the products of the 
frequencies per unit and the rth powers of the variable, we have 
Total Frequency (0th Moment) —iV 
1 / 

and rth Moment = — 2 ftixtY—fin say. 

N <-i 

These moments are calculated with reference to the com- 
mencement of the range. It is often more convenient, however, 
to calculate them with reference to the arithmetic mean, which, 
it is to be noted, is the preceding first moment ijl[. Clearly it is 
only necessary to measure the variable from instead of from 
the commencement of the range; the rth moment about the mean 
is therefore j 

— S say. (o) 

N t-i 

Similarly, as for the changing of the origin of coordinates, the rth 

1 

moment about any arbitrary origin X is — S Stixt—Xy. 

N «-i 

Since the commencement of the range is, in relation to the 
mean, an arbitrary measure, (o) may be used to express the 
important relations between /u,, the rth moment about the mean, 
and n',, the rth moment about any arbitrary origin. For, expand- 
ing the binomial in (a), and putting r successively = 0, 1, 2, 3, 4, 
. . . , we find immediately 

Mo=A(o = l 

A»i = 0 

SAtliUj +2 * 

+6(Mi) Vj — 3 

The relation between the second moments, for example, is illus- 
trated for the point binomial by formulae (4), (5), and (6). For 
we have above AtJ-'Ms+CMi)* (which, from the definitions, means 
that the second moment about an arbitrary origin equals the 
second moment about the arithmetic mean plus the square of 
the arithmetic mean measured from the arbitrary origin); (4) 
means that n[=np, (6) that =»/’(«/' +2), and (6) that m^npq, 
which conform with 
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The same relation is used on p. 216; B; 9 in transferring the 
mean square deviation — 2, . . . , j/) measured from the 

mean npt^ to np. For nptqt is the second moment about the 
mean npt, and is to be transferred to the arbitrary origin np. 
Hence nptqt represents by the relation we must 

add (mi)®! which is the square of the mean measured from the 
arbitrary origin np, or {npt — npY, so that the second moment 
(the mean square deviation) measured from np instead of from 
npt becomes nptqt + inpi — npy. 

In the preceding paragraphs it has been convenient to denote 
by M a moment about the mean, and by m' a moment about any 
arbitrary origin, in respect of a frequency distribution of ordin- 
ates. The same notation and relations evidently may be used 
in the case of a continuous curve. For if y =f(x) represents the 
equation of the curve extending from x — h to x = ky so that the 

area= ydx^ then the average value of x (i.e., the mean /ij) is 

k k k k 

yxdx-^ y generally the fth moment Mr = yx''dx-^ \ y dx\ 

Jh Jh Jh Jh 

correspondingly the rth moment about the mean is 
Mr = (»: - Ml) ^dx dx ; 

and the relations between m and m' are those already given. 
Adjustments to Moments 

An adjustment, however, obviously may be necessary when — 
as in the case of fitting a mathematical curve to a set of statistical 
data — we have to establish the relations between the moments 
obtained from the integration of a continuous curve on the one 
hand, and, on the other, the moments derived from (a) the given 
ordinates, or (6) areas, of the statistics themselves. 

(a) In the case of a series of ordinates which evidently would 
finally vanish with high contact with the x^axis (i.e., asympto- 
tically) at both ends of the range, it will be clear from the follow- 
ing formula that no adjustment would be required. But if the 
series at either end begins abruptly, an adjustment should be 


18 
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based on the appropriate quadrature formulae, with the assump- 
tion that ydx^yo; for example, for n given ordinates 

;yo, . . . , 3^n-i, and y dx 2 iS the corresponding area of the 
curve, it is easy (see P:5;8:26-29) to establish the relation 
I y dx = 1. 1 220486 (yo4"yn—i) ”f“*7588542(yi-f‘yn— 2 ) 

+1.1578125(y2+yn-s)+.9612847(y3+yn-4) 

+(yi+-- •+yn-5) — (i) 

and by the right side of this expression to compute the corrected 
ordinates (for examples see P:3£:28 and 35). 

( 6 ) When the statistics are given in groups (i.e., in a system 
of areas) f and there is high contact at both ends, the same prin- 
ciple leads readily (see P:5;^:30, or Fill 4 >93) to Sheppard* s Cor- 
rections (H:75), by which /i 2 , msi and jU 4 are first computed directly 
from the data, and then 

Ma (corrected) =jL 4 a - ^ 

12 

lit (corrected) ==M 3 


and /Z 4 (corrected) = M 4 — I AV 2 + — h^ 

240 


where h is the width of the class interval. Numerical examples 
are shown accessibly in P:f 77:160, F:33:30, and P:51 :55. 

Instead of thus calculating from the grouped data and subse- 
quently introducing Sheppard's corrections, Hardy has suggested 
a method of estimating the central ordinates of each group as the 
original numbers for each group less of their respective 
second central differences, and thence computing the moments 
directly without further adjustment. In the case of mortality 
data this process has the considerable advantage of giving values 
for the central ordinates which are useful in the calculation of the 
force of mortality, as well as providing a simple means of exam- 
ining the nature of the curve from the differences of the logarithms 
of the ordinates (see P:5 j?:57-59). 

When there is not high contact at both ends, Sheppard's 
formulae are not applicable, and other methods must be followed. 
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The central ordinates can then be found as indicated in P:5;?:236, 
and adjusted by {i) herein. More elaborate corrections suggested 
by Pairman and Pearson, and by E. S. Martin, may also be ap- 
plied (see P:S;^:231 and 236). 

Hardy's Summation Method of Computing Moments 

A very useful scheme for calculating moments by means of 
successive summations was suggested by Sir G. F. Hardy (P:5i: 
59), and has been widely adopted in actuarial work. The method 
is explained clearly in schedule form in P:32:20 (see also P:51 :59 
and 124 for the mathematical relations between the summations 
and the moments). Other examples which are shown in detail 
may be found in P:136 :3 and H :1 06:292 and 322. The abbrevia- 
ted notation S, 2*, 2^, etc., is frequently used to denote the succes- 
sive sums. 


B ; 28 . Thiele’s Half-Invariants 

The remarkable system of symmetrical functions named by 
Thiele “half-invariants’* (called by some writers “semi- 

invariants” or “seminvariants”) sometimes provides an elegant 
method of dealing with the sums of powers. 

If Si = u\ +^2 + . . . + =2 where (^c = 1, 2, 3, . . . n) 

denotes an observed value, then the “half-invariants” Xi, X 2 , X 3 , 
. . . are defined by the identity (with respect to v) 


^ 4. 4. if!!! 4. 

11 21 31 • , SiV , SiV^ , , 

S(,e =^0-^ H — H h 

1! 2! 3! 

= e“''+c“‘'+e“'’+ .... 

Differentiating (a) with respect to v we obtain 


.(a) 




XiP X £»* 

II + 21 + • • • 




or, substituting by (o) for the first term, 


, I , W , 


1 ! 2 ! 


^0 


,SiV , S2V^ 

+TT+2r 


+ 


...1 + + 

J L 1! 2! J 


1 ! 2 ! 
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from which, by equating coefficients of powers of v, 
5i =Xi5o 
^2 =Xi^i4"X25o 
58 =Xi^2 4"2X25i+X35o 
54 =X 153+3X252 +3X351 +X450 



\sa/ \S(iJ \so/ \io/ 


X3 



-M-3 


+ 12f-' 

V^O, 



-6 



4 


by which the half-invariants are expressed in sums of powers. 

The relation between the half-invariants and moments (see 
p. 253; B; 27) is likewise 


11 21 


= 1 + — + + . . . , whence 

1 ! 2 ! 


Xi = //; 

Xj = Mj-(Mi)* 
X3 = 

or, for moments about the mean, 


Xi = 0 


X 2 = M 2 
X 3 = M3 

X 4 = At4“”3(M2)*- 


The utility of these half-invariants in analysis is illustrated in 
numerous publications, such as P'.36, 
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B; 29. The Gamma and Beta Functions 

The Complete Gamma Function for any positive number, i.e., 
n> 0, is defined as 


r(») 


poo 

= Jo 


^dx. 


In general, r(n + l) =nr(n); and r(0) = oo. 

If « is a positive integer, r(n+l)=w!; and r(l) = l. 

Ai , (n--2)(n— 4). . .1 . 

Also, = l!ifw IS even, and -y 

if n is odd ; and r(^) = \/7r. 

By the substitution x — it follows that 

J (where m> ~l)=ir^^— • 

The proofs may be found in P:^i:I, 250 and II, 323, or 
P:5^:237. 

The Incomplete Gamma Function is defined as 


r;j(«+l) =J e^^x^dx. 

Tables of r(w) or log T(n) are available in P:07, P:47, and 
elsewhere, and conveniently for actuaries in P:^;^:266. 

Values of the “incomplete*’ function have been tabulated in 
P:98. 

The Complete Beta Function of any two positive numbers, 
m and w, is defined by 

S(m,n)=J x^^^(\—xY~'^dx, 
and the Incomplete Beta Function by 

Bxim, - J dx. 

As shown, for example, in F:21 :II, 337 or P:5;^:237, 

Values of the functions have been tabulated in V\101, 
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C; 1. The Relation between a priori “Deviations’^ and Ob- 
served “Statistical Frequencies” 

In many presentations of this subject it is customary to 
illustrate the occurrence of a ^'deviation’' by such experiments 
as those of tossing a supposedly ''dynamically perfect” coin (for 
which the probability of head, say, falling is then known, a 
priori), and to picture, on the other hand, an "error” — that is to 
say, an error of observation occurring as a departure from a 
"true” value to be sought — as the observation of an expert 
marksman firing at a target with a perfect rifle on a windless 
day. Good as these modes of illustration are, however, it must 
be noted that their meaning must be carefully examined, for 
otherwise they may be over-simplified. 

The matter of the coin cannot be put better than in the 
following words of Levy and Roth (P:50:28): "It is clear that 
with a given coin which is tossed by some mechanical process 
(beginning always with, say, the head upwards") it could be 
arranged that the result of each toss is always head or always 
tail; or, alternatively, that the ratio of the number of heads 
to the number of tails takes on a certain series of values within 
a specified range. [This] illustrates the fact . . . that in any 
physical process to which probability is to apply, there are three 
interlocked elements: (1) a 'population', P, in the above case, 
of head and tails; (2) a process of selection S (here a mode of 
tossing); and (3) a sample, s, drawn from P by the application 
of S. This process may be stated symbolically in the form 

s=s{py\ 

With regard to the illustration of the marksman firing at a 
target — an instance where the a priori probabilities are not known 
— confusion will often be avoided if, with Herschel (H \28) and 
Ellis (H:;^7), we picture the problem not only as presenting a 
distribution of the shots around a "bull”, but as implying also 
the inverted question which perhaps can be put more clearly 
in these terms: If the shots have been fired at a wafer which 
is afterwards removed, how are we to determine, from the 
distribution of the shot marks, the most probable position of the 
wafer? 

The preceding viewpoint is essential to an understanding of 
the basic problem of Mathematical Statistics. The situation 
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ordinarily encountered in practice does not usually concern 
simple text-book cases such as a bag containing 10 balls, indis- 
tinguishable except that 4 are black and 6 are white — ^where it is 
at once apparent that the a priori probabilities are known, 
namely, .4 and .6 as the probabilities of drawing a black, or a 
white, ball respectively, at the first attempt, or in each of a 
series of successive attempts when the ball is replaced after each 
trial. The problem met in practice generally presents the 
investigator with a very different situation, for example, a series 
of observations, such as the shots around the wafer, as an 
accomplished fact — the a priori probabilities, it is to be noted, 
not having been known ; the investigator is then asked to estimate 
as closely as he can what the real or ‘‘true’* position of the wafer 
was. The distinction cannot be emphasized too clearly. 


C; 2. The Deviations in the Number of Occurrences, and in the 
Statistical Frequency 


If 5 be written for the number, np+x^ of actual occurrences, 

• s 

it will be seen that the observed statistical frequency is - (as 

n 

on p. 187 ; B ; 2), and the deviation x in the number of occurrences 
is s—np. It may be emphasized at this point that the deviation 
to which a probability is being assigned by this analysis is the 
deviation between the actual and expected numbers of occur- 
rences. It clearly involves, of course, in a similar manner but 


with a different scale of abscissae, the discrepancy 



between the statistical frequency, - , and the true a priori 

n 

probability, p. 


C; 3. The Symmetrical and Unsymmetrical Point Binomial 

To give a simple example, if there are n independent trials 
(n = 10), then if the chance of success in each trial is |(/> = §, g = J), 
the binomial distribution (3) gives for the probabilities of the 
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happenings of 0, 1, 2, ...» 10 successes the terms of the expansion 

~ [1 + 10+45 + 120+210 + 262 + 210 + 120+45 + 10 + 1], 

and therefore also these terms multiplied by the number of trials, 
10, for the distribution of the expected numbers of successes. 
These distributions are of course symmetrical, since /> = g. 

If, on the other hand, still with n==10 independent trials, 
the chance of success in each trial is .1 (/> = .!, g = . 9), the proba- 
bilities of the happenings of 0, 1,2,.. ., 10 successes from (3) are 
the terms of (.9+.l)^°i and the distribution of the numbers of 
successes is again found by multiplying each term by the number 
of trials, 10. These distributions, however, are '‘skew**, i.e., 
unsymmetrical, since p 9 ^q- 

In both cases the terms when plotted form merely a series of 
points (of a symmetrical and unsymmetrical bell shape, respec- 
tively), since under the conditions of the problems here considered 
the quantities are obviously not capable of continuous variation, 
i.e., np+x in (1) and (2) can take only the integral values 
0 , 1 , 2 ,... 10 . 

As an illustration of the meaning of the above distributions 
in a practical case, it will be seen that if the true probability of 
death at age 75, say, were known to be .1, so that for each of a 
homogeneous group of 10 persons of that age the probability 
of death would be, invariably, .1, then the theoretical distribution 
of the number of deaths which would be expected to occur in a 
series of such homogeneous groups would be that given by the 
second expression above multiplied by 10. 

C;4. The Practical Applicability of the Normal and Skew- 
Normal Curves 

The Normal Curve (10) is adopted very widely in practice, 
but the Skew-Normal form derived as (i) and (ii) in B; 5, and 
shown also in Chapter VII, is seldom used. It is particularly 
important for the actuary to appreciate the practical justification 
for, and the limitations of, this symmetrical “normal** expression, 
since, in his work concerning death rates, p and q are markedly 
dissimilar at nearly all ages — the probabilities of survivorship 



266 


Applications 


and mortality at most ages lying between about .995 and .75, 
and .005 and .25, respectively. The following examples may 
therefore be considered. 

To take a case when p and q are each near J, suppose that 
p == .44 (and q = .56). Then, when n is fairly large, such as 10,000, 
and X = +50, say, the normal expression (10) gives .0048, and the 

50 (--. 12 ) 

skew term e20ooo(.2464) = gggg does not affect the fourth decimal 
place. 

Even when n is small, for example 50, (10) with, say, jc = +10 
gives .0020, and the “skew** term is .9525, which causes an 
alteration only of 1 in the fourth place. The excellence of the 
approximation afforded by the symmetrical form (10) will also 
be seen from the fact that in this latter case, as a simple instance, 
the true value of the original factorial form (2), to which (10) 
is an approximation, is .0021. 

It is thus clear that the normal form gives exceedingly close 
results when p and q are each near to .5, even when n is small. 
This, of course, is to be expected, because the point binomial, 
of which (10) is an approximate representation, is but very 
slightly skew when p and q are nearly equal, whether n is large 
or small. 

When p and q are not nearly equal to \, furthermore, the 
results are ordinarrly very close, although in this case the value 
of n — the size of the “sample** — must be watched carefully. 

If n is large — employing an example from mortality data 
(P:i74:53) — we find that if at a certain age the “true** rate of 
mortality (here p, the probability that the event will happen) 

is known to be ~ exactly = .010753 (so that g = .989247), and 

41,385 lives were “exposed to risk**, then the probability of a 
deviation of exactly +30 from the expected number of deaths, 
np, according to the normal expression (10), is .0068, and even 
in that case the effect of the skew term is only to change this 
to .0066. 

If n is small, however, care must be taken in applying the 
Normal Curve. It will be noted, from the proof given at p. 204; 
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B; 5, that, in effect, n was supposed to be sufficiently large that 

m-i) could be neglected even if p is not nearly equal to q. 

Without entering into the somewhat complex mathematical 
considerations underlying the admissibility of this assumption 
under varying circumstances (for which see P:/4^:119-133), it 
will be evident that the use of the normal form (10) cannot be 
accepted with confidence when p (or g) is so small and n suf- 



Figure 21. — Unsymmetrical Point Binomials for Small Values of np. 


ficiently large that np (or nq) remains finite but small (see p. 230; 
B; 15 and p. 306; C; 14). This may be realized, also, from a 
consideration of Figure 21, showing the frequency polygons of 
an unsymmetrical point binomial {q-\-pY such as (.9+.!)’*, for 
small values of np up to about 10, i.e., values of n up to about 100 
(which might occur in small mortality groups). 

It will be seen that, although the polygon rapidly assumes a 
shape very close to the normal as np increases, nevertheless for 
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certain of the smaller values of np and n the lack of symmetry 
is very marked. Under such circumstances, therefore, it will 
be advisable to examine carefully the applicability of the Normal 
Curve in any particular case. 

From the above examples, and from others shown in V:27: 
182-4, it will be seen that the neglect of the skew term in the 
Skew-Normal form is usually unimportant. When the skewness 
is so marked that the Normal Curve is inappropriate it will, in 
fact, be preferable to use the Poisson exponential (55), which is 
developed later in Chapter VII and illustrated at p. 306; C; 14, 
rather than the Skew-Normal correction. 


C ; 5. The Finite Integration of the Normal Law of Deviations 

Taking the mortality example of p. 266; C; 4 again, and hence 
supposing that n represents a number of lives, such as 41,385, 
exposed to risk at a certain age, and that the true rate of mortality 

is known a priori, or is given, as ^ exactly, it follows that the 

yd 

probability of a deviation of exactly +30 is, by (10), approxi- 
mately .0068. With the possibility of errors of different magni- 
tudes, this probability of an error of a particular magnitude 
occurring is, of course, small, and, furthermore, is obviously not 
a matter of any special significance. The important question for 
investigation, in fact, is clearly not this small probability of some 
such specific deviation occurring, but is rather the probability 
that any deviation which may occur will be within certain 
limits, or that it will not exceed a certain amount. In this case 
the expected deaths are 445; let us therefore examine the proba- 
bility that the actual deaths, instead of being the 445 expected, 
will lie between 415 and 475, i.e., that the deviation will not 

exceed 30 in either direction. Being 2 ( — j===:t 
, \'\/2Trnpq 

where fe=30, the summation is ordinarily effected in practice, 
as explained at p. 207 ; B ; 6, by taking either 
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^7“ e ^*dt+yk, or— 7- e ** dt, where c = \/2npQ, from 

v^rjo V^Jo 

tabulated values of the ^‘probability integral** or “error function’* 

'-“f e^^'dt (see p. 160; A; 5). 

^ ^ 

In the above illustrative case, ^ where <; = v 2n/?g is 1.0279, 

k ^ 

and - =1.0111. The alternative formulae therefore give from 

^ 2 , 2 ri o279 

these tables —7- e ' +^30 =.854, or-— e~*d/ = .854. 

V 0 V TT J 0 

That is to say, it may be anticipated — neglecting the third 

decimal place — that in about 850 experiences out of 1,000 the 

actual deaths amongst 41,385 exposed to risk, when the true 


rate of mortality is 


1 

93 


will lie between 445+30 atid 445 — 30; 


that is, in about 85 experiences out of 100 the deviation might 
range up to, but would not exceed, 30 in either direction. A 
deviation of 30 or less between the expected and the actual deaths 
would therefore not occasion surprise. Or, more precisely, the 
interpretation to be placed on this result would be that a devi- 
ation of 30 or less would not, as a practical matter, be significant, 
in the sense that it would not raise the presumption of the 
existence of a disturbing influence beyond that of merely chance 
fluctuations. 

It may be useful to note also here that the complementary 
interpretation follows (as may similarly be demonstrated from 

X""Qo y— — Jfe 

the summations S 3/3.+ S namely, th^t a deviation 

Xm»k JC- — 00 

of 30 or more would be anticipated in only about 15 experiences 
in 100. 


C; 6. Numerical Illustrations of the Mean Error, Mean Square 
Deviation, Standard Deviation, and Probable Error. 

The application of these formulae may be illustrated as 
follows. 

Using again, as a particular instance, the mortality case 
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of C; 4 and C; 5, with « =41,385 exposed to risk and A 

yd 

exactly = .010753 (so that g = .989247, and the expected deaths 
w/) = 445), should find: 


{a) Mean Error, Irrespective of Sign, by (20), 

= .797885/^ 41385^^^ accurately, 




approximately = 16.78 nearly. 


That is to say, the average deviation (irrespective of sign) from 
the 445 expected deaths to be anticipated in an experience of 
41,385 lives exposed to risk is 16.74 accurately, or approximately 
16.78. 


(b) The Mean Square Deviation, or Variance, being, by (7) 
and (14),=«/?g, is similarly 440.22. 

(c) The Standard Deviation, <r, being \/npq by (8), accord- 
ingly =20.98. 

It then follows, since the probabilities of deviations lying 
within the ranges zt<r, ±2<r, zkd<r, . . . have been shown to be 
.6827, .9545, .9973, . . ., that in this case these latter probabilities 
are those of deviations within the ranges (to the nearest integer) 
±21, ±42, ±63, ... from the mean (expected) number of 
deaths, 445. Or, to put the matter more specifically, the 
probability is, for instance, practically .9973 that in an experi- 
ence of 41,385 lives, with a true mortality rate of ^ , the number 

9o 

of deaths will actually lie between 445 — 63 and 445+63, i.e., 
between 382 and 508. 

If, as in a body of lives insured against the contingency of 
death, we view an excess number of deaths as unfavourable, the 
probability of actually experiencing 445+63 =508 deaths or more 
will be about .00135; while, on the other side, the probability of 
unfavourable experience in a group to whom annuities had been 
sold, to the extent of the deaths numbering only 382 or less, 
would be .00135 approximately. That is to say, insurance or 
annuity experiences unfavourable to this extent would be 
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anticipated in only 13.5 (say 13 or 14) times in 10,000 trials, and 
thus would be very unlikely. 

Experiences unfavourable to the extent of either +2<r (=42) 
or more, or —2(r ( = —42) or less, would similarly have probabili- 
ties of .0228, so that actual deaths of 445+42=487 or more in 
an insurance experience, or 445—42=403 or less in an annuity 
group, would be anticipated only 228 times in 10,000 trials, or 
only in slightly more than 2\% of the experiences. 

(d) ¥ovt\i^ Probable Error, X, which, by (21), =.674489\/w^ 
= 14.2, the numerical applications are made in exactly the same 
manner. Basing an illustration on =bX, for instance — since the 
use of X itself rather than its multiples was its original concept — 
we see that the probability is .5, i.e., it is an even chance, that 
the actual deaths will lie (using integers) between 445 — 14 and 
445+14, i.e., between 431 and 459. It is consequently also an 
even chance that they will exceed 459 or be less than 431. Con- 
sidering, as before, the probabilities of positive and negative 
discrepancies separately, the probability is .25 of unfavourable 
deaths in an insurance experience to the extent of 459 actual 
deaths or more, while again in 25% of the experiences the un- 
favourable number of deaths in an annuitants’ group would be 
431 or less. 


The solution of the inverse problem is also important, namely, 
the determination of the number of trials necessary to secure a given 
probability for a stated deviation between the actual and expected 
values. Suppose, for example, that we wish to find approxi- 
mately how many trials, w, must be taken to secure a probability 

of .999 that - will differ from by c or less. 


From (11), the probability of a deviation in np of d or less in 

absolute magnitude is — 7- e ^*dx=^ —7- ^^~‘*d/ = Erf 

c^TT J 0 v^J 0 

s 

and since the stated deviation of e or less between - and p means 

n 

a deviation of en or less between 5 and np, we have d=€« and 
consequently must find (see p. 162; A; 5) the value of n for which 
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Erf 



= .999. 


But .999 = Erf (2.327); hence we simply re- 


t r 

quire the value of n for which ~ =2.327, where c = v2n^a. 

c 

To take a numerical case, let p — 5 = and € = .01; then 

=2.327, whence « =24,066, i.e., it would necessary to 

V2pq 

have (approximately) 24,066 trials in order to be satisfied, to the 
extent of a probability of .999, that the observed statistical 
frequency, - , would be within .01 of the true value, p, 

ft 

This example is, of course, an illustration of the meaning of 
Bernoulli’s Theorem (cf. p. 187; B; 2). 


C ; 7. The Mean Square Error, of the Observed Values of 
Certain Actuarial Functions 

The general formulae (23), (27), and (28) for the mean square 
error in a multiple, a linear compound, and a function, respec- 
tively, are frequently required in actuarial problems — particu- 
larly in connection with ^ ^graduation” (see Chapter VIII) and 
the ^theory of risk”. It is therefore important that the student 
should have available the expressions for the most usual of these 
cases. They are consequently assembled here. 

It must be remembered that the analysis by which the general 
formulae have been deduced is usually admissible except when 
q (or p) is so small and yet n large enough that nq (or np) is less 
than about 10 (see p. 265; C; 4). The practical application of 
the formulae for the particular cases given below is consequently 
subject to the same limitation. In any doubtful cases the 
smallness of q (or p) and nq (or np) should therefore be examined 
critically in relation to the comments of p. 306; C; 14. 

The employment of a descriptive notation will be especially 
convenient here in order to identify the instance under discussion. 
We shall therefore write <t^[A } to denote the mean square error 
of the observed quantity A, etc., and likewise o[A\y y}{A]j and 
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X{i4}, etc. Thus a^[qx] will represent the mean square error 
of the observed value of the rate of mortality, Xfg*} will 
denote its ^'probable error*'; and so forth. 

{i) The Survivors Observed at the End of a Year of Age 

If n persons comprising a homogeneous group, all of the same 
age are observed over the year of age x to:x;+l, and np^ of 
them actually survive and so attain age :)c + l, it follows from (7) 
that the mean square error in this number of ^‘successes’* is 
npxQxi where px and qx are the “true** (not the observed) proba- 
bilities of survival and death in the year of age a: to x + 1. The 
probable error, similarly, is .&74:5\/npxqx by (21). In the 
practical application of these expressions the “exposed to risk**, 

is of course often shown instead of n. 

This formula for the probable error has been stated and 
illustrated in H;54:187-8. If, for example (employing the data 
which are, in fact, there used), n is taken as 7,943 and the 
observed survivors at the end of the year as 7,847, with px = .9879 
and gx = .0121, we find this probable error to be 6.57, which 
means that it is an even chance that the survivors at the end 
of the year (out of the 7,943 entering upon that year) will lie 
between 7,847+6.57 and 7,847 — 6.57, i.e., between 7,853.57 and 
7,840.43, or (using integers) between 7,854 and 7,840. 

It is to be noted carefully that the formulae should be applied 
and interpreted, as here shown, for the purpose of examining 
the mean square error, probable error, etc., in isolated observed 
values over a single year of age when the true px and qx are 
known or can be assumed. 

Since the preceding statistical example is stated in H:54:188 
on the basis of certain numerical values from a hypothetical 
“life-table**, it may therefore be advisable to warn the student 
that the formulae used therein cannot be applied directly to 
find the mean square error, etc., of the hypothetical life-table 
function (the number of survivors at age x from the 
lx^ 2 i ... at earlier ages) when that lx is constructed from a 
series of inter-related observations at earlier ages. For if an 
observed lx is thus built up from an arbitrary radix, say at 
age a, by successive multiplication by the observed probabilities 
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of survival p'+i, .... we have =lap'ap'a+i • ■ - p'x-u The mean 
square error in this function would then be found approximately 

by mean, of <28) “ "s ')] 

by (v) on p. 276, = ll S ( ) , where again the undashed 

/-a \piEt/ 

symbols refer to the *‘true” values. 


(ii) The Deaths Observed during a Year of Age, 6^ 

By the same reasoning as that in the first paragraph of (i), 
it follows at once that if n, or £^, persons aged jc in a homogeneous 
group are observed from age x to x+1, and if 0^^ { — riq^^E^qx) 
of them die, then from (7) the mean square error in these deaths 
is again npxqx ( =Elpxqx)i where px and qx are the “true” values; 
and the probable error, by (21) , is .6745 = .6745\/ E^^pxqx) • 
As in (t), these formulae refer to an isolated observation of a 
single year of age, and do not apply to the hypothetical life-table 
deaths, d*, where dx is derived from a series of inter-related 
observations at earlier ages. 


/ 6 

{Hi) The Observed Probability of Death, qx == ~^r 

Using a similar argument for this case also, it will be seen 

immediately from (14) and (22), or (23), that (T^{qx] = • 

E>x 

e' 

(iv) The Observed Central Death Rate, nix == r ’ 

^ E^x-^\ 

Since rUx is defined as a death rate operating upon an observed 
exposed to risk Ex^^ in the middle of the year of age, and pro- 
ducing during that year the observed deaths 0^, it follows, as 

in (m), that f — where m* is the “true “value. 

This formula has been used in H:,4>^’164. 

An alternative approximate formula, a* { , has been 

derived in F:61:100, and it is there also suggested that in 
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(m') 

practice this may be modified to give as a 

Ve' 

rough approximation for the standard deviation. For, since 
-Ex+i / Ex^l - 1*^ and wi* = =3x^1 - .we see that 

the expression in the preceding paragraph may be written 

/ / 

/ . Wj. . nix . nix 

Consequently cr{w*} simply, 


and conveniently, the rate divided by the square root of the 
deaths. It will be realized, however, from the nature of the 
approximations involved, that these formulae should not be 
relied upon for anything more than rough indications. 


(v) The Observed Probability of Survival^ px 

By exactly the same principle as that used in (m), 

(vi) The Observed cologe px 

This can be obtained at once, approximately, from (28). For 

in that formula F=f{Fi) =f(px ) ; 

= ^ by (v) here. 

Hence irMcolog />'} = (i) (^)= (cf. H 43:318 and 
P:lS3:281-2). / PxEx 

A method of deducing this expression from without 

the use of (28), is given in H:108:25d, The same formula may 
also be obtained from first principles as shown in H:70:393. 


d (colog px) 

\dFi/ L dpx J \px) ’ 


(m) The Ratio of Actual to Expected Deaths 

Since the * ‘expected deaths” are entirely independent of the 
rate of mortality, g', actually experienced, this is a simple 
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1 


case of (23) where the multiplier k = 

Expected Deaths 

and (T* {Actual Deaths} by (7). Hence 

,/ Actual Deaths / 1 _ 

(Expected Deaths) ** \E'^qJ 


1 

£'g,’ 




The student should note that in some practical applications 
this formula has been modified. It is somewhat common prac- 
tice to assume that px may be taken as unity without sensible 
loss of accuracy. This, however, should not be done without 
careful examination — see p. 293; C; 10 here, and (for example) 
J.I.A., LXVIII, 62. 

If it is justifiable, 


becomes simply—^ = ^ , as used in P:^. 

E^qx ExQx Expected Deaths 

If, in addition, it should be possible to assume that qx 

would not differ markedly, then 

^ (cf. J.I.A., LXVIII, 62). 


— ^ becomes -*4-/ 

Exqz E^qx 


Actual Deaths 

In dealing with a special category of lives — such as those 
with some particular medical impairment, or in a hazardous 
occupation — which may be subject to mortalities differing 
markedly from any standard table, it will be evident that the 
observed probabilities of death, g*, of the special category may 
themselves provide a much closer approximation to the ‘*true*' 
probabilities of the special category than will be afforded by 
the values, say gj, of the standard table with which the observed 
probabilities might be compared. Under these circumstances, 
therefore, it will be better to take rather than 

KpW^- 


Then 








or 


Ve;,; _ M.S. 


where the 


^ / Actual Deaths ( ^ V'E'p'g' 

(Expected Deaths/ ~ ^ ~E^ 

'^mortality ratio’’, M,R,, is the ratio of actual to expected deaths, 

i.e.,--^. This method is used, for example, in P;^:12. 

^x9.x 
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{viii) The Rate of Mortality by Amounts 

The preceding formulae for <T‘^{qx], etc., are based on the 
customary definition of the probability of death in the year of 
age X to x + l (the ‘‘rate of mortality*’, as it is usually called), 
being the probability that a life aged x will die between ages 
X and Jc + 1. In the practical estimation of the monetary losses 
to be anticipated in plans of insurance, however, the measure 
of “success” or “failure” may obviously be indicated more 
closely in some cases by the amounts of money involved than 
by the mere enumeration of the lives without regard to the 
various monetary burdens which they represent. Under such 
circumstances it becomes important to consider the mean square 
error of the so-called “rate of mortality by amounts”, i.e., the 
probability that a random monetary unit (say $1.00) exposed to 
risk will become a claim by death in the year of age x to x+l. 
The necessary formulae for this case have been deduced by the 
following reasoning in P:19. [In the notation which is adopted 
here, a takes the place of Cody’s 5 in that paper, and the ex- 
perienced values are, as throughout this book, distinguished by 
dashes.] 


Since at age x the various lives exposed, which hitherto have 
been assumed to be a homogeneous group of individuals, will 
now be associated with varying amounts of the monetary unit, 
it will be seen that the total “exposed to risk” £' must be defined 
as 2 (®jE^), where is the exposure in the “amount class” 
a when the amounts at risk in the year of age x to x + l are 
classified by the varying amounts of the monetary unit, and 
where the S covers all the values of a. Similarly the actual 
losses which emerge upon death, totalling in the year of age, 
will be 2(®^^). From these definitions it follows that in the 
amount class a the observed probability of loss in the year of age x 

to X+l will be — y say, and that for all amount classes in 


that year of age the observed probability will be 




e: 


=?* 


say. Correspondingly, if the “true” probability in the amount 
class a be written the expected losses in that class will be 
= say; and if the “true” probability for all the amount 
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classes combined be denoted by then the expected losses in 
all amount classes will be say. Furthermore, we must 

evidently have the relation = consistently with 

for the observed data. 

In this model there are a units in the amount class a for 
every unit which would be tabulated if the data were compiled 
on the basis of lives only. We can therefore write 
where denotes the actual deaths (as distinct from the mone- 
tary losses) in the amount class a, and similarly 
where denotes the lives exposed to risk. Consequently 

by (23). Moreover, by section 
{ii) of this Appendix C; 7, we shall have ^px 

so that =a®(®/« ®5®). Proceeding now to the con- 

sideration of for the groups, it follows at once from these 
relations and (27) that 2 (®^;^)1 = 2[a2 

=S[a and from this result we see that or 

for all the amount classes, which by (23) 

KE^x) . {Exr 

becomes 2[a «£' ^px ^qxl 
(Ex) 

In order to get this into a form related to the usual 


= as stated in (Hi) here for the probability of death when 
Ex 

only lives (not amounts) were under consideration, we note that, 
since ®5*==2ar+(5j'“ff«) and ^px=px^(ql—qx)t the expression 
just given can be put as 

(Ex) 

~ (£'}^ “£-x(sx Qx)(px g*)]^ 

= _J_ s[o 1 + r 


-*> (x) 


where-J!!. 


iKy 


1+ 




2[o “£*] 
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and namely, the total number of lives exposed to risk. 

4 ) 0 

This — ^ , in accordance with section (m) here, is the mean 
lx 


square error of the rate of mortality observed in a group of l^ 
lives exposed to risk, in which the ‘‘true” rate of mortality, qxy 


0 

is— , as previously defined; and jR^ represents the ratio in which 

that mean square error, based on lives, is increased when the 
investigation follows a classification by amounts. 

A numerical illustration may be found in F:19:72. 


(ix) The Ratio of Actual to Expected Losses by Amounts 

In section {vii) an approximate formula for the standard 
deviation of the ratio of the actual to expected deaths by lives 
f 1 • ikf R 

was found to be *. — , where the “mortality ratio”, 

b' 

M,R, = — , and ql is the rate of mortality according to the 

^xQ.x 

standard table used in computing the expected deaths. When 
the ratio of actual to expected losses is to be taken by amounts 
(not by lives), the development in section {viii) evidently must 

be used instead, with the formula o’Ms*} the 

basis, where px and qx are the “true” probabilities (based on 
amounts) for all the amount classes combined. In dealing with 
a special category of lives, moreover, as pointed out in section 
(m), it will be preferable to use the observed probabilities, rather 
than those of the standard table, as an estimate of the “true” 
probabilities. 

Now, corresponding to crMs*} we have (t^[Bx]^ 

= (^*)V{g*} = {E^RxY which here is to be 

taken approximately as 


Hence also 
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/>', as in {vii), since />*=!, this is ^ 

\lxH^xY \lJ{E^q 

-■(! 


) . The corresponding approximate formula 


for the standard deviation, a, is therefore Rx A/ — 

^ lx W^'/ 

„ M,R, stated in P:i^:72. 


or 


VP 


(a;) The Observed Expectation of Life, e^ 

A convenient approximate formula for may be 

obtained easily by writing 

and applying (28) thereto since the values of q' are independent. 
Dropping the primes for the preliminary formula which follows, 
we see that 


dOx __ d 

d 




^ (-1) 

Px+n 


{px + 2px + - . •+npx)-\-{{pxPx-^l . • . Px+n) 

'\~iPxpx+l . . . Px-\-nPxArn+\)-\- ’ * *1 j 

|^(/^x+2^x + .. . + n/>x) + (l “-gx+n){(/>x^x4-l* • - Px+n-l) 
~^{Pxpx^-\ . . . />x4-n~lPx+n+l)+. . .} 

[■ 


+l/'*+n+2/'*+- . .J 

= ^ J Cj+ri, since the g’s are independent 
^»+n\ lx / 


so that - = — 1 when <=«, and is 0 otherwise, 

^ffx+n 
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Using this result in (28), therefore, and remembering that, 
by {Hi) here, , we find immediately (co being 


the limiting age) 




S 


n = 0 


( /x+n / P x-^nQ,x-\-i\ 

Px-\-nlx ) \ -^x+n / 


\lx/ «=oL Px-\-nE/x-\-n - 


which corresponds to Steffensen's formula for the life annuity 
in {xi) below. 

A comparable formula for with a discussion of the 

effect of using groups of ages, is to be found in 'P\158, 


{xi) The Immediate Life Annuity^ 

The first examination of the mean square error of a^ was 
undertaken by G. F. Hardy in P:5f:99, where he deduced an 
approximate formula only, and gave numerical illustrations. 
It may be useful to note his conclusion that “the standard 
deviation in the values of the 3% annuities in an aggregate 
experience such as the [the British Offices’ Life Tables, 

1863-93, Males, excluding the first 5 years of insurance] is about 
one-fiftieth of a year’s purchase from about 30 to 65 years of 
age”. He also pointed out that in “an experience including 
about 1000 deaths distributed approximately as in the data, 
the deduced annuity values between ages 30 and 60 would on 
the average be uncertain to about ±.20, or from 1% to l\% of 
their values”. 

A closer approximation was next given by J. F. Steffensen 
(P:fSS:281), who used the method of proof based on (28) as in 
(jc) above, and thus reached the following formula correspond- 
ing to that already given for the expectation of life (which is, 
of course, the immediate life annuity at zero rate of interest): 



This expression shows clearly that if the exposed to risk, £', 
be multiplied at each age by a constant factor, fe, then a{ai) 



282 


Applications 


will be divided by y/k. For example, the total exposed in the 
experience was 7,659,454, and in the [Institute of Ac- 
tuaries* Life Tables, Healthy Males, 1863] was 1,199,093; if, 
therefore, the same proportionate distribution could be assumed 
over the various ages, q[ax] for the experience should be 
about .4 of that for the — an indication with which the figures 

at decennial ages ‘^agree very fairly except at the highest ages’* 
(P:i55:282). 

Finally, T. Tinner (P:Jf4^:305) deduced from first principles 
an exact expression in the form 

cHa',] = +2a.+,)|(l + 

and demonstrated its relation to Hardy’s and Steffenscn’s 
approximations. Tinner’s investigations showed that the ratio 
of the exact standard deviation to the value thereof derived 
from the approximate expressions is generally much less than a 
maximum of about 1,04, so that, ‘‘as it would never be necessary 
to compute the standard deviation with minute accuracy, it is 
clear that the approximate expression gives a result close enough 
for all practical purposes”. 

(xii) ** Linear Compounding^' in the Theory of Graduation 

Formula (27), which shows the relation of the mean square 
error of any linear compound of a number of independent 
quantities to the mean square errors in the quantities themselves, 
is of very special importance to actuaries, since it is used for 
certain basic criteria and comparative tests in the theory of 
graduation (see Chapter VIII). Its customary method of 
application for those purposes, and the manner in which it 
affords a measure by which the graduating power of different 
“linear compound” graduation formulae may be assessed, are 
explained in the author’s paper and in P:1^7:lll, towhich 
actuarial students must be referred for details. It will be 
sufficient here to give (from that paper) the following statement 
of the principles involved. 
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“Graduation** by “linear compounding** concerns the replace- 
ment of an observed value u[ (of a “true** value u^) by a linear 
compound, Vr, of and terms Wr+i» • • • w'-i, w'- 2 , ... on 
either side of it, on the assumption that differences of u beyond 
a certain order j may be neglected. Any such graduation 
formula, therefore, will be of the form 

Z;r=^rWr + (/r+lWr4-ld"^r-lWr— l)+(/r+2Wy4.2+/r-2W,,_2)+ . . . 

• . • d“(^r+nWr+n "h^r—n^r—n) • . • • (^) 

for a range of 2 m + 1 terms; and where the /*s are symmetrical, 
so that say, this becomes 

Vr=boUr+biU^±l+b2u'r±2 + . . .+Jn«r±n (b) 

where Ur±t written for 

Now if the “error** in the observed Uf be Cr, so that =Mr+^r, 
it may generally be assumed, in dealing with series of observed 
data such as rates of mortality, etc., that the ^*s are independent 
and that the mean square error of each is (say) It follows 
then, by (27), that the mean square error of Vr in (a) is (S/^)e*. 
The process of “linear compounding** expressed in the general 
form (a) therefore replaces the original with its mean square 
error by the “graduated” Vr with mean square error (2/J)6^; 


that is to say, the graduation will have reduced the mean square 
/y 72\ 2 

error of to ^ , or of its original amount. If a par- 
62 


ticular linear compound formula (i.e., a formula of given range 
and with certain values of the Vs) has been determined in some 
manner, this “reduction of error**, 2/J, effected in the observed 
Uy may consequently be taken as a measure of the “accuracy** 
of the graduation. Furthermore, clearly we can set out, by 
making Z/J a minimum, to find the Vs (for a given range) which 
will give the greatest possible reduction of mean square error; 
the resulting linear compound formula will then be the “best** 
formula (for that range) on the particular assumptions made. 

The same principle can be applied to investigate the effect 
of any linear compound graduation upon the mean square error 
in the differences of instead of in itself. Since in practice 
differences beyond a certain order j may usually be assumed (as 
stated above) to be negligible, several cases arise according to 
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the value assigned to j, and depending also upon whether the 
mean square error of or of or of ... is considered. 
The reductions thus effected in the mean square error of the 
differences may be taken as criteria of ' ‘smoothness’ \ in contrast 
to the “accuracy” of v itself; and again, by determining the Vs 
to give the greatest possible reductions, the formulae may be 
evolved which will be the “best” expressions under the con- 
ditions assumed. 

The formulae which secure the greatest possible reduction in 
the mean square error of v itself are, in fact, identical with those 
which result from “fitting” a polynomial Vx=A+Bx-{-Cx^ + 
. . .+/jc^ by the Method of Least Squares (see P:7^^:100, and 
p. 132 here). Their first complete treatment was given in 1871 
by Erastus L. De Forest, after preliminary consideration by 
Schiaparelli (H:5^). 

For actuarial purposes, however, when usually j=3, the 
important cases are those giving the greatest possible reduction 
in the mean square error of A% or of A^r. The former were 
investigated originally in a most valuable and elegant series of 
papers by De Forest as early as 1873 — De Forest’s work 
(which for many years remained unknown) thus antedating in 
its conception many other later contributions which still are 
often credited with priority. The case of A^i; was indicated first 
by G. F. Hardy, then stated completely by W. F. Sheppard, and 
later restated independently in different form by R. Henderson 
and again by J. R. Larus. The history and details of those 
investigations are set out in V:166t with an appraisal of De 
Forest’s fundamentally important contributions. A summary 
is given in section VII of Chapter XI here. 

{xiii) The ''Theory of Risk" 

Although it does not seem advisable in this study to examine 
in detail the somewhat extensive mathematical and practical 
considerations involved in the so-called “Theory of Risk” 
attaching to the “grant” or sale of life annuities and insurances, 
it may be well to indicate briefly the nature of the problem, the 
manner in which it utilizes the concepts here under discussion, 
and the appropriate bibliography. 
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It is important, firstly, to realize that the matter is essentially 
distinct from that dealt with in reaching such formulae as those 
for (T^{ax] in {xi). The latter concern the '"errors” which arise 
as a result of limited data (the observed exposed to risk and 
deaths), and thus-seek to measure the effect of such errors upon 
the computed annuity value a*. The formulae of {xi) according- 
ly represent the mean square error in a^, i.e., the average of the 
squares of the deviations expected in the observed value of as 
deduced from a given set of exposed to risk (£') and deaths 
(£'g'). They consequently provide a means of testing the 
permissible range of variation in the annuity values so observed, 
in accordance with the principles summarized on p. 21, and 
illustrated at p. 269 ; C ; 6. For example, if we have a known and 
widely accepted mortality experience, such as the O^, from which 
the "true” values of q and p could be taken, and if some other 
body of data had yielded at age x the observed annuity value 
then a deviation of or more would be very unlikely 

as a result of merely chance variations, and its actual occurrence 
would indicate the existence in that other body of data of some 
definite cause leading to so great a variation from the ex- 
perience. 

The "Theory of Risk”, on the other hand, considers the 
deviations encountered in the estimated value of an annuity 
(or an insurance) which may be supposed to have been issued 
on a specified life, or lives. The distinction between the two 
problems is well drawn by Steffensen’s remark (P:l 38:280) that 
the formulae such as those for in (xi) concern the origin 

of the tabulated values, whereas the ""theory of risk” examines 
their application. 

The first treatment of the fundamental formulae required 
was published by Bremiker (H:4f:286) in 1871. A very con- 
venient restatement of Bremiker’s reasoning and conclusions 
was given by G. F. Hardy (P:5f :104) in 1909, and the question 
has again been examined on the same lines within the last few 
years (see 'P:103 and P:SP). It will be sufficient here to give 
the essence of the method, as shown by Hardy, for the case of 
the continuous life annuity, 5*. 

Suppose that an annuity is granted to a person aged x at a 
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price 5* — the annuity payments being assumed to be made 
continuously throughout the year, and interest being corres- 
pondingly based on the force of interest 5( = loge(l +i) = —logev). 
Then if the annuitant should die at the end of time /, the deviation 
(as at the date of entry) from the mean value dx for which the 
annuity was sold to the specified individual would be aji—a*; 
the probability of the annuitant so dying at time t is tpxfJ^z+t 

= — ^ (tpx)t and therefore the mean square deviation, being 
at 


the sum, for all values of /, of the squares of the deviations multi- 

plied by the frequency in each case, is [dj] — axYipz^^z-^-idt, 

__ Jo 

Now = g = ^ ^ » so that the integral 

becomes ^ ^^iPxiJ>z+tdt+jA v^\pxixx^tdt 
_ (Azy-2Az{Az)+Ax _ Ax-jAzY 


6 ^ 


62 


where is computed at a 


rate of interest corresponding to* (that is, 6' =26, or at a 

rate j = (l+i)^ — 1). We thus find that the standard deviation of 

the distribution for which dx is the mean value is ~ VCil— 

6 


Illustrative numerical values are given in H:^f :289, P:f 05:243-4, 
and P:50:71, which show that the standard deviation ranges 
generally from about 5 at the younger ages to about 2 or 3 
at age 80, and represents an increasing proportion of the mean 
value dx. 


The comparable formulae for the standard deviations of the 
distriburions for which the mean values are other functions 
such ^s Ax^ €xt ax, the terminal reserves, etc., are indicated easily 
in V:51 :105, and are proved also, with numerical interpretations, 
in P:f05 and T?:89. 


The formulae of the type just considered in effect examine 
the mean square errors and standard deviations over the entire 
lifetime subsequent to the age at entry x. They are consequently 
to be distinguished carefully (cf. P:5^:78 and 602) from those 
referring only to each year separately, which form the basis of the 
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discussions of the ''risk** problem in numerous other papers 
(such as 1^:117:296, and P:55:40). In this latter case 

suppose that each of n persons effected t years ago a policy of 1 
at age x; then the standard deviation in the deaths for the year 
of age x+t to i!c+/ + l (being the policy year /+!) is, by (8), 
a/ npx+tQx+t] the net amount at risk on each policy at the 
beginning of the year is ?;(1 — i+iF*), where is the terminal 
reserve at the end of the year; and consequently the standard 
deviation in the amounts at risk for the year is 

» ( 1 - (+i F,) V npx+tQx+t = tRx (say) (c) 

The obvious generalizations of this basic formula for the cases 
when the amounts at risk are not the same for every individual, 
or the age varies, are discussed in P:53 and P:1 17:296, 

The relation between this formula for a single^ year and the 
corresponding expression for the entire duration (see also 
P:^P:602) is given by Hattendorff*s Theorem (H :38), which states 
that if is the mean square error in the amounts at risk for the 

entire duration, and tRl by (c) above is that for the (/4*l)th 
/-»00 

year, then S tR^x tRl where tEl=v^^tpz and is therefore a 

pure endowment calculated at a rate of interest corresponding 
to (that is, 6' =26). A demonstration of this theorem, 

by Steffensen, and two other proofs due to N. P. Bertelsen, are 
given in P:1S9:8. 


C ; 8. The Standard Deviation of the Arithmetic Mean 

Particularly in view of certain later discussions in Chapter V, 
it will be well for the student to be quite clear as to the meaning 
of this result. A single quantity is supposed to be under obser- 
vation; r independent and unbiassed observations are made; the 
arithmetic mean of the results is taken ; and it is assumed that <r^ 
is known. 

For example, imagine a large homogeneous group of n persons 
(all of the same age, sex, habitat, etc.), who may properly — 
within practicable limits — be considered to represent a "random 
sample** (see Chapter V) ; and suppose that the number of deaths 


20 



288 


Applications 


(within a year, say) is the “single observed quantity”. Then 
if 5 i deaths actually occur (giving an observed statistical fre- 


quency of deaths of — ), the mean square error, by (7), is npq, 
n 

where p and g are the “true” probabilities of survival and death. 
Now if another similar group, of precisely the same size «, is 
again observed, and deaths are found (with an observed 


statistical frequency of — ), the mean square error is again npq. 

n 

Suppose, therefore, that in r such similar groups, each of size w, 
the deaths observed quite independently are Su ^ 2 , • • • » Sr- The 
mean square error for every “observation” will be npq\ the 
arithmetic mean of the r independent determinations is 

• ■J.- ' -l h i- ; then by (30) the mean square error of that result 



That is, (t[A,M.] = 



C; 9. Numerical Applications of < 7 ^ {Ratios} to Actuarial Data 

(1) Suppose that in a homogeneous sample group of 1000 
men, 65% are eligible to qualify for admission under the rules 
of a specified pension plan ; what are the limits of variation thence 
to be anticipated in the “universe” of men from which the sample 
was drawn? 

By formula (8), <7 = VTOOO^, where p is the true percentage 
of eligibles in the “universe” or “parent population”. Although 
p is thus not kn own, pq cannot exceed (i)(^), or J, so that <7 
cannot exceed '\/l000(i), or 16. Using now the principle that 
±3(7 embraces over 99% of the deviations (see p. 20), it follows 
that the limits of variation may be put at ±3(16) = ±48, which 
is ±4.8% of the sample of 1000. Hence the percentage of 
eligibles in the parent population would be between 65% ±4.8%, 
i.e., between 60.2% and 69.8%. 

In view of the meaning of Bernoulli’s Theorem concerning 
the approach of the observed statistical frequency to the 
true probability as n increases (see p. 187; B; 2), an approxi- 
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mation of a somewhat closer kind would be obtainable in such 
a case as this, where p and q are unknown, by taking the ob- 
served statistical frequency, .65, as an estimate of p. Without 
at the moment discussing the justifiability of this substitution 
except to remark upon its clearly reasonable nature (see p. 292; 
C; 10), we should then find that o- v V^1000(.65)(.35) =15.08, 
so that ±3(7^ ±45.24, and the percentage of eligibles in the 
parent population would lie between 65%±4.5%, i.e., 60.5% 
and 69.5%, which differs only slightly from that previously 
obtained. 

(2) Suppose that in the above example it were known (or 
could be confidently assumed) that 65% is the *'true” value of 
the percentage of eligibility, and that a comparable random 
sample of 1500 cases had been taken some years later, which 
showed 1100 eligibles. Can this be attributed to chance fluctu- 
ations alone? In a group of 1500 the expected number of 
eligibles = 1500(.65) =976; <r = V 1500(.65)(.35) = 18.47; the ±3<r 
limits are therefore ±55.41, so that in a sample of 1500 the 
eligibles would lie between 975±55.41, i.e., between say 920 
and 1030, if the fluctuations were due solely to chance. The 
occurrence of so many as 1100 eligibles therefore suggests that 
some specific cause must be sought as the reason for so great a 
variation, 

(3) Again assuming 65% as the ‘"true"^ percentage of eligi- 
bility, suppose that the 1500 sample showed 1100 eligibles 
(i.e., 73%), while a second sample of 2000 gave 1200 eligibles 
(i.e., only 60%). Can the difference shown have arisen solely 
from chance variations? Here formula (31) is directly applic- 
able, giving (T^ for the difference between the observed pro- 

portions (73% and 60%) of which 

<7 = .01629. Hence ±3(7 for the difference is ±.048. The 
difference between the proportions is .13, which lies far outside 
the ±3(7 limits; consequently it is to be concluded that some 
influence other than chance fluctuations has operated to cause 
the change between the 60% and 73% in the two samples. 

While this rapid method of applying the ±3(7 rule is very 
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generally adopted, and indicates the conclusion clearly in such 
a case as this, the probability itself can of course be calculated 
directly from the tables of the probability integral very easily. 
For it has been shown in (26) that the Normal Curve is applicable 
to the case of the sum of two independent quantities, and it 
follows from its extension to (27) that the normal law also holds 
in the more general linear compound case, of which this problem 
is an instance. Now (see p. 162; A; 5) the probability, under 

the normal law, of an “error** of id or less is Erf , and 

the probability of an error as much as, or more than, d 

is therefore 1 — Erf (-■ —- ) . Here d = .13, and tr = .01629, whence 

j \<rV2/ 

— ^ =5.64. The probability that the chance distribution of 

cr V 2 

errors of the normal law would give an error so great numerically 
as .13 when <7 = . 01629 is therefore 1— Erf(5.64), or practically 
zero. That is to say, so great a difference as that between the 
observed 73% and 60% cannot have occurred by chance — some 
specific cause must have influenced the change. 

(4) Since the above procedure may, with proper reservations, 
be applied to the mortality rates of different samples or groups, 
it may be well to give another example (from P:7^:270, slightly 
modified). 

Suppose that at a certain age in a particular district one 
group of 224,728 males exhibits a probability of death (g^) of 
.00486, while another group of 244,906 males gives ql as .00420 
(the data having been collected with as much regard as is prac- 
ticable for the homogeneity requirements of simple sampling); 
it is known that the “true** rate of mortality for males of that 
age in that district can be taken as .00453 ; does the difference 
of .00066 in the death rates of the two groups indicate a real 
difference in the mortality, or might it have arisen from purely 
chance variations? By formula (31), for the difference 
in the two observed rates of mortality is (.00453) (.99547) 

(^^728 244W6 )’ = .000196. Consequently the 

probability of a difference (an “error”) as great as .00066 
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is l — Erf^— where d = .00066 and o' = .000196, giving 

1 — Erf (2.38) = 1 — .9992 = .0008, so that the probability is almost 
negligible that the difference observed can have been due to 
chance, i.e., there is a real difference in the mortalities of the two 
groups. 

Using the zbSo- rule, we see that ±30- = ±.00059; the ob- 
served difference of .00066 again lies outside that range, and 
hence is not attributable to mere chance. In this case, however, 
the indication given by the probability itself is clearer on account 
of the order of magnitude of the figures involved. 

In examples (3) and (4) it has been supposed that the “true"’ 
p and q are definitely known. If this is not so, however (and 
in actuarial problems it usually is not so), it would be necessary 
to form an estimate of their values by reference to previous 
experience, or to use the evidence of the data alone by resorting 
to formula (32) as an approximation (see p. 292; C; 10). 


C; 10. On Certain Approximations for npq when the “True’’ 
p and q are not known 

Throughout this study it has been emphasized that the 
primary objective has been to explore, mathematically and 
statistically, the deviations which may be expected to occur, by 

pure chance, between an observed statistical frequency, — , of .y 

n 

successes in n trials, and the true a priori probability, p. The 
fundamental thought underlying Bernoulli's theorem is the 

approach of - to p as n is increased indefinitely. 
ft 

Now it has been seen that the various formulae involving 
the probabilities of these deviations require that the true proba- 
bilities, p and g, be known. The fundamental mean square 
error, for example, of the np=s successes in n trials, as given 
by npq in formula (7), means that in a series of independent but 
similar observations, based successively on Wi,W 2 , . . . , cases. 
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the respective values of <r* would be nipq, n^pq, ...» n,pq — for 
the basic conditions are unchanged, sO that p and q remain 
the same. 

In some problems these true values of p and q may be known. 
But if they are not, it is obviously necessary in many cases, if a 
numerical solution is to be reached at all, to form some estimate 
of their values. If there is no outside source of information 
which can be used, it may be necessary to employ a value based 
on the observed values as an estimate of the true p — a procedure 
which clearly would become less and less of an approximation 
as n increases. 

In the practical application of these methods to actuarial 
problems the true p and q are seldom known. If, consequently, 
it becomes essential to base an estimate on the observed values, 
the immediately obvious method would be to assume that, if n 
is large, the observed statistical frequency may be employed in 
each case, so that the above series would be taken as nip[qu 

ttip'iqi, .... n,plql where py= ^(y = l, 2, . . v) (cf. F:116). 

This method of approximation, however, must be used with 
considerable care. Yule and Kendall express the following 
view: '‘Precisely how large n must be for this approximation 
to be valid it is not easy to say. Samples of 1000 are almost 
certainly large enough, and we may often apply the foregoing 
procedure to much smaller samples, say of 100. For samples 
below that figure it is well to examine carefully the circumstances 
of any given case and to proceed with caution” (P:f77:355). 

When we are dealing with more than one sample it is clear 
that the estimate of p would be based on a larger body of data if. 


instead of taking py= ^ (y = 1, 2,. . .,j/) from each separate 

sample, the observed values were all combined as . 

Wi+W2 + ...+Wv 

This is, of course, the method shown by formula (32) for two 
samples only. 

In the case of mortality statistics, for which at many ages 
q is small, it is frequently suggested (as in P-A15) that an 
approximation may be used by first substituting the observed 
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values pu p2i •• -t pi each case for p (and correspondingly for 
q), as noted above, and then taking p[=p2= . . . =pl h 1 . If 
this is legitimate, then the values of for each of the v inde- 
pendent samples of Wi, . . . , which strictly are nipq, n^pq, 

. • M become first nip[qu n2p2Q2t •.«, n^plql, and then 

niqu ^222* . • M The a® of their sum, by ( 27 ), would be 

Wigi+«232+. . But Uyqy—dy, where dy represents the 

observed deaths of sample y, so that ZUyq'y is the total of all the 
observed deaths. The approximation is thus reached that for 
the deaths in such a series of mortality statistics might be taken 
as merely the total of the actual deaths — a concl usion which the 
student will often find stated in the form V^Actual Deaths. 

The fact that this approximation must be applied with 
circumspection can hardly require emphasis. From its nature 
it evidently results from several assumptions whith may, under 
particular circumstances, be far from close approximations. 
For, like the original npq of ( 7 ), its basis is questionable if q 
is so small (i.e., p near enough to 1 ) that nq is small with a large w; 
the substitution of the individual sample values pu p2t • • pl 
(and qu 32, • • m 5 ^) for the true p (and g) may be unjustifiable if 
Wi, are not large; the assumption that p[ =p2 = . . . —pi 

must be examined carefully; and the further supposition that 
each py (or even V^) can be taken as 1 also imagines q'y to be so 
small that again the validity of the whole basis may be questioned 
unless each Uyqy is perhaps 10 or over when n is not also small. 
These are the theoretical reservations; in any practical case they 
should be given full consideration, and the complete formula 
resulting from the use of a ‘‘true” p (and g) for each Wi, W2, . . . , 
should be tested against the o % V^actual deaths approximation 
before the latter should be adopted. 

C; IL Applications of the Lexis Theory 

Experimental Verifications 

Examples of actual drawings of cards, and balls from urns, 
in conformity with these Bernoulli, Poisson, and Lexis methods 
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af sampling are given in P:5^:136-145 and P:ii>^:84-7, which 
show close agreement between the results derived from the 
actual experiments and the theoretical values by the formulae. 

Hypothetical Illustration of the Three Types 

The following very simple hypothetical example, constructed 
by Forsyth (P:-^4:195), illustrates the practical application of 
the method. Suppose that the numbers of deaths in 10 districts 
in each of three countries (the populations of each district being 
assumed to be 1,000), are as given below: 


Actual Deaths per 1,000 Population {Dr) 
in 10 Districts (r) in Three Countries {A, B, and C) 


District {r) 

Country {A) 

Country {B) 

Country (C) 

1 

17 

18 

24 

2 

16 

17 

22 

3 

11 

10 

20 

4 

12 

11 

19 

5 

13 

9 

18 

6 

14 

12 

10 

7 

15 

19 

9 

8 

16 

16 

8 

9 

14 

10 

6 

10 

12 

18 

4 

Totals 

140 

140 

140 


The average number of deaths in all three countries is 14 per 
district. Assuming for the moment that we may here use the 
observed instead of the unknown true values (which strictly 
should be used in conformity with the analyses in B; 9) we there- 
fore have »^ = 14; ^ = .014; 3 = .986; and <rB = \/l4(.986) =3.72, 
which would be the standard deviation if the probability of 
death were constant throughout. 

The actual values of a, however, are now computed directly 
from the data. For Country (C), for example, we have 
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\Dr—np\^ 

100 
64 
36 
25 
16 
16 
25 
36 
64 
100 

4CO 

482, whence ^2= — =48.2, 
10 

or g- = 6.94. 

We thus find by actual calculation from the data that 

1 QO 

For Country (A), cr = 1.90, and — = .51 

3.72 

Q 74 

For Country (B), <r=3.74, and L=— ^ — = 1.01 

3.72 

6 Q4 

For Country (C), <t= 6.94, and L =-^ — = 1.87 

3.72 

Remembering that here <r5=3.72, and that an actual value less 
than <tb indicates variation of the Poisson type within the sample, 
while a greater value denotes variation of the Lexis type from 
sample to sample, these results suggest the conclusion, on the 
Lexis theory, that greater variation occurred within each district 
of Country (A) than from district to district; that variation was 
about the same throughout (B); and that there was greater vari- 
ation from district to district in (C) than within each district. 

This example was originally given, and is here reproduced, 
only as a simple illustration of the Lexis method of approach. 
Students should note carefully, however, that the observed data 
are used therein to give estimates of the “true” values which 
strictly are required by the analysis of B; 9; rigorously, there- 
fore, Bessel’s correction (42) should be introduced in such cases 


\Dr-np\ 

10 

8 

6 

5 

4 

4 

5 

6 
8 

10 
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by multiplying (41) and its results by » as indicated in 

P:^S:508, or by adopting an even more refined adjustment due 
to Tschuprow which is demonstrated and illustrated in :216- 
222 . 

(a) Practical Illustration of Bernoulli Type 

An application of the Lexis theory under conditions which 
were found to be Bernoullian has been given by the Russian 
Jastremsky in connection with the Austro-Hungarian 

mortality investigation of 1876-1900 (P:54:135). If $[*-«]+« 
denotes the “select” rate of mortality at attained age x for the 
year of duration t since entry, and the “ultimate” rate at 
attained age x based on the data of 5 or more years of duration 

since entry, the ratio for values of t from 0 to 4 will indi- 

cate the effect of duration upon the rate of mortality at attained 
age X — ^being <1 if the effects of “selection” at entry are still 
apparent, and equalling 1 if the mortality is no longer dependent 
on duration. If now, for each value of t from 0 to 4, this ratio 
is computed for the various values of the attained age x^ five 
series are obtained (one for each / from 0 to 4), each of which 
will be constant as x varies if the ratio is independent of Xy but 
will vary as x varies if the ratio depends on x. In order to 
examine whether the series of ratios (for, say, any one of the 
values of t) could be treated as being independent of Xy Jas- 
tremsky in effect computed cr (about its average value for all 
ages) from the actual series of ratios, and also the Bernoullian 
<tb on the hypothesis of independence, so that the Lexis ratio 

L == — would test the admissibility of that hypothesis. For 
(TB 

endowment assurances, for examples, the values of L for / = 0, 1, 
2, 3, and 4 were found to be 1.01, .96, 1.05, .98, and .91, and 
were thus all close to 1. The inference to be drawn therefore 
was that the Bernoullian hypothesis was plausible for each value 
of /, so that the observed fluctuations in the ratios by age could 
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be attributed to chance alone, and for each value of t 

might reasonably be treated as being independent of the at- 
tained age X (see P:5^:166). 


(6) Practical Illustration of Poisson Type 

The Poisson type of variation from group to group within 
the ‘^universe” is encountered directly in dealing with mortality 
statistics. Yule and Kendall, for example (P:^ 77:366) consider 
the following as an illustration: In a population of n persons, 
all of one sex and one age, with a rate of mortality of 12 per 1,000 
(being .012) throughout — thus conforming with the Bernoullian 
conditions of simple sampling — (Tb in the death rate, being 

JPS by (36), is ^ . If, however, a 

V n \ n lOOOvw 

population (still composed of one sex only) had the same average 
(crude) rate of mortality (.012) throughout, but now were to 
include various age groups with different rates of mortality, such 
as about .064 in infancy, decreasing to about .0025 in childhood, 
and thence continuously increasing until old age, the standard 
deviation, cr^, of the death rate about its mean within such a 
population would be about .024; hence, from (37), (rp, the 
Poisson type standard deviation, would be 


1 


y/n 

The small difference 


V(.012)(.988)-(.024)=* 


106 


1000 v/m 

between this value and the standard 

deviation of simple sampling, namely, - , consequently 

1000 v n 

indicates that the effect of the variation among the individuals 
within the population of a country under such practical mortality 
conditions is not likely to be serious. 

Similar calculations may be made to test the importance of 
non-homogeneity in any group of lives to which a single mortality 
rate is applied, or from which such a mortality rate is computed. 
It is interesting to note that non-homogeneity actually decreases 
the standard deviation in comparison with that of a homogeneous 
group. 
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{c) Practical Illustration of Lexis Type 

Since usually the populations will not really all be equal (as 
was assumed in the demonstration of the formulae), it is desirable 
in practice to employ a technique which will give due weight to 
such variations in size. A mathematical examination of the 
corrections theoretically necessary, with a numerical example, 
has been given by Arne Fisher in P:5^:157-160. For practical 
purposes in connection with mortality data, however, it will 
generally be sufficient to use the groups of varying size as if they 
were all of the same size, except that due allowance should be 
given for the varying size in determining the basic rates of 
mortality involved. The point may be illustrated from the 
following example given by Rietz (P:fi4:89) for the rate of 
mortality in the first year of age in nine states of the United 
States: 


State 

Number 

Exposed to 
Risk 

Death Rate 
per 1000 

1 

50,707 

70 

2 

33,370 

85 

3 

57,915 

78 

4 

35,392 

68 

5 

53,658 

77 

6 

51,452 

66 

7 

51,832 

74 

8 

41,656 

78 

9 

54,472 

79 

430,454 


The simple arithmetic mean of the nine death rates is 75 per 
1,000. If they had all arisen from groups of equal size their or^, 
computed directly by using only column (3) above, would be 

^[(70 - 76)* + (85 - 75)* + . . . + (79 - 75)*] =32.67, whence <r = 5.72. 
y 

One obvious interpretation of the assumption that they had 
arisen from groups of equal size would be to suppose that in each 
state the same death rates per 1,000 would have been shown if 
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in fact the populations had all been equal — in which case the 
equal populations could clearly be taken as merely the average 


430,454 

9 


= 47,828. Under these circumstances the Bernoullian (Tb 


for the rate of mortality, being iS by (36), would have been 

a/ (-Q75)(.925) = 00120 per person = 1.20 per 1,000. The Lexis 
T 47828 


5 72 

Ratio is therefore =4.77, and Charlier's Coefficient of Dis- 

X .aU 

tiirbancy is 100^(5.72)^ — (1.20)^ =7.45. The data are thus 
75 


shown to be ^‘hypernormar’, of the Lexis type — indicating that 
the rates of mortality vary significantly from state to state. 

It is clearly preferable, however, to give due weight to the 
varying sizes of the populations, which may be done as follows. 
The weighted average death rate per 1,000, which allows properly 
for the variations in size, is 


70(50707) +85(33370)+ . . . +79(54472) . 

430454 


We now compute, from the data, the (7^ for the rate with reference 
to this mean instead of to the simple average 75 previously used — 
which, by the rules given at p. 254; B; 27 is 


50707(70-75)2+33370(85-75)2+ . . . +54472(79-75)2 

430454 


less 


(75 — 74.864)2, whence <7 = 5.40. But if the conditions had been 
Bernoullian, groups of equal size, viz., 47828 for each state as 
before, would have shown the weighted average death rate of 
74.864; under such circumstances ctb would have been 

.,,203 p.r 1,000. 

5.40 

The Lexis Ratio is consequently =4.49, and Charlier’s 

1.20o 

Coefficient of Disturbancy is again positive. The significant 
variation from state to state, which was found by the first simpler 
method, is confirmed. 
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In these examples, again (as noted on p. 295), a correction of 
Bessel’s type should strictly be introduced — cf. the similar illus- 
trations, with Bessel’s correction, in H:7^S(English translation): 
217-220. 

Many other examples of the analysis of birth, death, and 
marriage rates for various localities or periods by this procedure 
are to be found in the literature (P:ii^:153; P:5^:151-165; and 
P:;^7:320 and 330). The actuary, however, with an innate 
distrust of supposed homogeneity, will of course in practice 
examine closely every such series not only by locality and period, 
but also according to age, sex, occupation, and any other charac- 
teristics which may have influenced the data. He will not defer 
detailed analysis until he is warned by the theory of the Lexis 
ratio. That theory, nevertheless, while thus of limited applica- 
bility in the actuary’s practicing equipment, is fundamentally 
important in the development of the underlying theories of 
Mathematical Statistics. 


C ; 12. The Standard Error of the Mean 

(1) From the results of Weldon’s experiment in casting 6 or 6 
in 26,306 throws of dice, as tabulated at p. 334; C; 25, it may 
easily be calculated that the mean of the distribution on the 
hypothesis of unbiassed dice is 4, and for the observed series is 
4.0524. The sta ndard de viation computed from the observed 
frequencies is V2.69826 The standard error of the mean 
(which is based on the 26,306 values) is therefore 

V2.69826 

/ =.01013. 

V26306 

This result may now be used, instead of the method of 
C; 25, to test the admissibility of the hypothesis that the dice 
were unbiassed. For the deviation between the hypothetical 
and observed means is .0524; this, however, is over 5 times the 
standard error of the mean (.01013); and as it consequently 
exceeds the limit of twice the standard error which is generally 
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adopted (in conformity with the principles of Chapter III) 
as a reasonable test, and even exceeds 3 times the standard error 
which is usually regarded as a conclusive test, the inference to 
be drawn is that the observed values are not compatible with 
the hypothesis that the dice were unbiassed. 

(2) The use of the rule that, on the assumption of normality, 
±3 times the standard error will define the limits within which a 
sample value of a parameter may be expected to lie may be 
illustrated by considering a sample distribution of the heights of 
1,000 men, with an arithmetic mean of 5'7" and a standard 
deviation (computed from the sample) of 2", from which the 

2 

standard error of the mean is seen to be — =.063. Then it 

VIOOO 

is practically certain that the true mean will lie within the range 
67it3(.063) inches, i.e., between 67.19" and 66.81". 

(3) As another illustration Eldcrton suggests, in P:5^:191, 
that the formula for the standard error of the mean could be 
applied similarly to examine the average profit from various 
classes of business for a series of years, and thence to determine 
whether some particular average profit should be attributed to 
chance alone. The standard error of the profits in the various 
years would be computed as the standard deviation of the 
observed series, divided by the square root of the number of 
years. 


C; 13, Illustrations of “Student’s” Distribution 

(1) The following example of the application of formula (45) 
is given by Deming and Birge in P:;^P:138. Suppose that 10 
equally reliable readings on a micrometer (so taken that the 
assumptions of a normal universe and randomness are satisfied) 
show the values 1.078, 1.080, 1.071, 1.076, 1.081, 1.077, 1.075, 
1.073, 1.079, and 1.070, with mean = 1.0760 and standard devi- 
ation O', = .00355; examine the hypothesis that the true mean, w, 
of the parent population from which the sample was drawn is 

1.0740. Here w = .002, and "Student’s" — ^=.563; 
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«=*10; and P, in (45) (which is read easily from NekrassofT’s 
nomograph) is .13. That is to say, in about 1 in 8 samples of 10 
we should expect to find z in absolute value as large as or larger 
than .663 (and in about 1 in 16 samples z would be as large as or 
larger than +.563). On the assumption that = .00355 is not 
unusual, the inference to be drawn would therefore be that there 
is no strong reason to reject the hypothesis that the population 
mean, m, is 1.0740. 

As pointed out in the text on p. 46 here, it is important to 
realize that the validity of this conclusion depends on the sup- 
position that the value of namely, .00355, is not unusual. This 
reservation is overlooked so frequently, and it bears so vitally 
upon the inferences to be deduced, that space may be taken here 
to illustrate its meaning further, with extracts from the admirable 
presentation by Deming and Birge in P:;^^:138-9. 

Let us recall, then, that in the above example nothing 
whatsoever is stated to be known with respect to the a of the 
universe from which the sample of 10 readings came; the position 
therefore is that * ‘without some knowledge concerning a the 
only thing we can do is to postulate that the sample was not 
extraordinary’* (loc. cit., 138), and proceed as already shown. 
If we knew, however, from previous comparable observations, 
that O' may be supposed to be very close to .0040, for example, 
it will be apparent that the reasonableness of o', = .00355 in 
comparison with o'=.0040 can be examined if we have the law 
of distribution of o',. This distribution is available in {e) at 
p. 225; B; 13; and for samples of 10 from a normal universe 
with (r = .0040, the average standard deviation can be placed at 
.0040 X. 9227 = .0037 from Table I in P:^^:128. This is 
evidently so close to the observed a, = .00355 that the latter 
can be accepted as being not unusual. With such knowledge 
that O', is not unusual, the “Student” test, as shown here in (1), 
obviously could be applied with confidence. Under these cir- 
cumstances, however, the classical normal theory can also be 
employed, for or is known closely; in this example, for instance, 

the normal theory would use ~ =1.58, and from 

/ O' \ / .004 \ 
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the chart in P:^^:134 it is seen iiniiiediately that the probability 
here of drawing a sample of 10 with an absolute difference in the 
mean as great as, or greater than, the postulated .002 is .114 — 
“which means that there is about 1 chance in 9 that \x—m\'^ .002, 
or that there is about 1 chance in 18 that (3c — m)^ +.002“ 
(loc. cit., 139). This test and the “Student” test (as previously 
applied in (1) here) “therefore concur, as they will when o-^ 
is not extraordinary” (loc. cit.). 

To see in another way how dependent the “Student” test is 
on <Ts being not unusual, suppose now that in the universe 
(r = .0025, so that the observed = .00355 seems to be unusually 
large (for, as shown in P:^P:139 from the distribution of <r^ 
again, in samples of 10 with (r = .0025 we should expect to get cr^ 
as large as or larger than .00355 only about 17 times in 1,000 
trials). The “Student” test, with its ignorance of o*, still gives 
P« = .13, as in (1) here, and so does not suggest rejecting the 
hypothesis that the population mean, m, is 1.0740. But with cr 
now known to be .0025, the normal theory is available, and a 
calculation similar to that in the preceding paragraph shows that 
the probability of then drawing a sample of 10 with an absolute 
difference in the mean as great as, or greater than, the postulated 
.002 is only the much lower value .0114 — which certainly would 
suggest rejecting the hypothesis, in contradiction of the 
“Student” inference. This disagreement “shows how misleading 
the latter would be if used alone; the trouble comes, of course, 
from the fact that is now exceptionally high” (loc. cit.). 

(2) The same type of investigation can be made with 
respect to the heights, weights, etc., of individuals (see, for 
instance, P:i 77:440) when we know that the universe may be 
supposed to be normal, that randomness exists, and that is 
not unusual. 

(3) Another illustration frequently quoted (e.g., in P:f 76^:583 
and P:45:127) was given by “Student*' himself in the following 
statement of additional hours of sleep induced in the same 
patients by the use of two different drugs: 


21 
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Drug 

Drug 

Difference 

Patient 

No. 1 

No. 2 

(No. 2 -No. 1) 

1 

+0.7 

+1.9 

+12 

2 

-1.6 

+0.8 

+2.4 

3 

-0.2 

+1.1 

+ 1.3 

4 

-1.2 

+0.1 

+ 1.3 

5 

-0.1 

-0.1 

0.0 

6 

+3.4 

+4.4 

+ 1.0 

7 

+3.7 

+5.5 

+1.8 

8 

+0.8 

+ 1.6 

+0.8 

9 

0.0 

+4.6 

+4.6 

10 

+2.0 

+3.4 

+ 1.4 

Mean (x) 

+0.75 

+2.33 

+ 1.58 

Standard 




Deviation (Os) 

1.70 

1.90 

1.17 


Here again let it be assumed that the necessary conditions 
are satisfied regarding normality of the universe, randomness of 
the samples, and the values of <rj being not unusual. Then: 

(а) If it were desired to examine the probability that drug 
No. 1 will cause an increase of sleep, it would be necessary to 
take m, being the deviation of the mean of the sample from 
the specified mean of the universe, as .75 — 0; then 

= =.44; / = 0A^r=I = . 44(3) = 1.32; 

and from the tables the corresponding probability is .888. 

(б) To obtain the probability that drug No. 2 is more effec- 
tive than No. 1 we similarly take z = ^ =1.35, whence 

/ = (1.35) (3) =4.05, and the probability is .999. Otherwise, from 
the tables in Fisher's form, it is seen that for w — 1=9 only 1 
value in 100 will exceed 3.250 by chance (P:45:127 and 177). 
On either reading, therefore, the probability from the tables of 
the observed or a larger positive difference appearing by chance 
alone is very small, so that the difference between the results of 
the two drugs is undoubtedly significant. 
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With regard to the practical utility of these (and similar) 
examples, it must of course be realized that a number of ad- 
ditional and comparable experiments would be necessary, in 
order to test the stability of the indications, before the inference 
suggested by the single experiment could properly be made the 
basis for any future action (here the future prescription of either 
drug). 

(4) Example 3(6) just given exhibits the effects of two 
different actions (i.e., the administration of drugs No. 1 and 
No. 2) upon the same n{ = 10) individuals, and, from the evidence 
so tabulated, tests the significance of the difference (+1.58) 
between the two means. In the same way we could take two 
comparable sets of individuals, with n( = 10, say) different persons 
in each set, then tabulate the effects of drug No. 1 upon set No. 1 
and of drug No. 2 upon set No. 2, calculate therefrom x and 
for the differences so shown by the two samples (as in the last 
column of the table in example (3) above), and proceed as in 
example 3(6). Illustrations of this method may be found con- 
veniently in P:i 77:441 (example 23.2), and in the first calculation 
at P:43:133. 

(5) As pointed out in the text, however, the use of different 
individuals in the two sets will obviously decrease the reliability 
of the results obtained therefrom as in (4) ; some advantage may 
consequently then be gained by using R. A. Fisher's extension 
of the '‘Student" principle for the case of two independent sets 
(cf. P:45;132-3). Furthermore, if there is no correspondence 
at all between the members of the two sets, Fisher’s extension 
for the case of two independent sets will clearly be the only 
applicable method. 

As an illustration, take again the data of example (3), 
and suppose that the results of drugs No. 1 and No. 2 were 
secured from two sets of patients so distinct as to require the 
assumption of complete independence. Then (as shown in 
P:45:130, ex. 20) — ^2 = +1.58 as before; but now we calculate 

taking with ni = 10, 

\Wi W2 / Wi + W2 ^ 
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n2 = 10, icr^ = 1.90, and 2a's = 1.70; hence /= = +1.861; 

V.721 

and entering the table of t for d = 10+10— 2 = 18 we find that 
the probability lies between .1 and .05, which cannot be viewed 
as significant. This conclusion, based on the supposition that 
the sets of patients were quite distinct, affords an interesting 
contrast with the conclusion of significance found by the method 
of example 3(6) on the supposition that the patients were 
identical. As Fisher points out (P:45:131), it provides a good 
illustration of “the value of design in small scale experiments, 
and that the efficacy of such design is capable of statistical 
measurement”. 

Another similar example may be found in P:45:133. In 
P:iJf <9:153-4 two comparisons are also shown between the results 
of the method here discussed and the classical normal theory. 


(6) A simple illustration of the use of Fisher’s 2-distribution 
(47s) in testing the difference between two sample variances, 
i(Ts and 2(r^, is given by Rider in P:li;?:118, where i(rj=9.6 
for a sample of Wi = 7 variates is to be compared with 20’5=4.8 


for «2 = 9. To calculate 



we remember that 



9.6 = 11.2, and that ^ similarly 2(^1 = 


g)4.8 = 5.4. 


so that s = ^Iog, 


(^)=- 


3648. 


From Fisher’s 


tables (P:45:251) we see that, since the degrees of freedom are 
di = 6 and </2=8, the “5% point” is .6378; the value .3648 for s 
is thus well inside that point; and the inference to be drawn 
therefore is that, on that test (and subject to the reservation 
concerning i<t] and 2 (tI being not unusual), would not be 
regarded as significantly greater than 2 <t], 


C; 14. The Applicability of the Poisson Exponential 

The striking ability of the Poisson function yr~ — — to 
depict markedly skew point binomials will be seen from the 
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following examples (given in P:S^:267, in PiJfJf 7:299, and 
P:5^:267 respectively), where the symmetrical Normal Curve 
1 -- 

y*== ^ would clearly give a very poor representation. 

Cy/ir 


10,000(.001-f .999)^««, i.e., ?=*.00l and nq^ A 


Number of 
Occurrences 

l*oint 

Binomial 

Poisson 

Exponential 

0 

9048 

9048 

1 1 

906 

905 

2 

45 

45 

3 

1 

2 

4 

0 

0 


10,000(.008+.992)2w i.e., ^ = .008 and n? = 2 


Number of 
Occurrences 

Point 

Binomial 

Poisson 

Exponential 

0 

1343 

1353 

1 

2707 

2707 

2 

2718 

2707 

3 

1812 

1805 

4 

902 

902 

5 

358 

361 

6 

118 

120 

7 

33 

34 

8 

8 

9 

9 

2 

2 

10 

0 

0 
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10,000(.06+.95)‘« i.e., j-.OSand »5=-6 


Number of 
Occurrences 

Point 

Binomial 

Poisson 

Exponential 

0 

59 

67 

1 

312 

337 

2 , 

812 

842 

3 

1396 

1404 

4 

1781 

1755 

5 

1800 

1755 

6 

1500 

1462 

7 

1060 

1044 

8 

649 

653 

9 

349 

363 

10 

167 

181 

11 

72 

82 

12 

28 

34 

13 

3 

13 

14 

1 

5 


It will interest actuaries to realize that the importance in 

practical statistics of the Poisson function yr = — for the 

probability of r occurrences of a rare event in n trials, where m 
is computed as the mean, was first illustrated by Bortkiewicz 
from the deaths in the Prussian army from an unusual cause, 
namely, the kicks of horses (see p. 166; A; 11). The data 
covered 10 army corps for 20 years, i.e., 200 observations, and 
from 122 deaths gave .61 as m, from which (by tables of the 
Poisson function — see p. 234; B; 15) the comparison between 
the observed and calculated values was remarkably close, as 
follows: 


Number of Deaths 
per Annum 

Observed Frequency 
of Occurrence 

Theoretical Poisson 
Frequency 

0 

109 

108.67 


1 

65 

66.29 


2 

22 

20.22 


3 

3 

4.11 


4 

1 

.63 


5 

0 

.08 


6 

0 

.01 
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The preceding illustrations demonstrate the closeness of the 
approximation to the ordinates themselves in certain types of 
cases. In considering, however, whether the Poisson function is 
to be preferred to the Normal Curve as a means of examining the 
probability of deviations within a certain range or exceeding a 
given amount, it will be well here to examine further some 
examples first given by H. L. Rietz (P:Jff 7:299 and 1^:115), 
For the point binomial (.008+.992)2^‘^ already noted — 
which might represent a group of 250 persons of the same age 
for each of whom the rate of mortality is .008 — we have g = .008, 
ng=2, and <r = \/ npq = 1 .4085. 

If the group of persons were annuitants, a number of deaths 
smaller than the expected 2 would be unfavourable. Using the 
binomial and Poisson values previously calculated, and from 
tables of the ^'probability integral” (see p. 161; A; 5), the 
probabilities of deviations not exceeding 2 in defect (i.e., of only 


1 or 0 deaths instead of 2) are 

For the point binomial 4050 

By the Poisson exponential 4060 

By the normal curve (area) 3233 


The result by the normal curve is therefore about 20% too low. 

On the other hand, if an insurance experience were under 
consideration, for which an excess of deaths would constitute 
an unfavourable event, the probabilities of deviations not 


exceeding 2 in excess are 

For the point binomial 2714 

By the Poisson exponential 2707 

By the normal curve (area) 3233 


Here the normal curve gives a result about 19% too high. 

When, however, we examine the probabilities in relation to 
the criterion suggested for the normal curve that a deviation 
of So- or more is very improbable, the skewness of the distribution 
again has a marked effect. Since 3(t = 4.2, and a deviation of 
+3(r would therefore imply 6.2 deaths, we may examine the 
probabilities of 7 or more deaths actually occurring, thus: 


For the point binomial 0043 

By the Poisson exponential 0045 

By the normal curve (area) 0007 
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In this case the normal value is about 84% too low — a serious 
discrepancy on the wrong side. 

If, therefore, such results — for small q (or p) and nq (or np ) — 
were to be used as a basis for determining the risk of unfavourable 
deviations from the expected mortality, it would be essential to 
apply the normal theory with great care, and with particular 
reference to the degree of skewness and to the question whether 
positive or negative deviations of given extent would constitute, 
in practice, an unfavourable outlook (cf. PrH 7 *.299-303). 

Although it thus becomes clear that the Poisson exponential 
may well, in certain cases, give much more reliable conclusions 
than the normal theory — especially when the investigations only 
involve either positive or negative deviations — it must be noted 
that the advantages of Poisson’s formula are not usually so 
marked when both positive and negative deviations are examined 
together, as is often sufficient in the practical consideration of 
sampling errors. 

As an example, in the somewhat extreme case just employed, 
the probabilities of deviations not exceeding ±2, taken together. 


would be 

For the point binomial 6764 

By the Poisson exponential 6767 

By the normal curve (area) 6466 


Here the normal curve gives an indication only about 4% too 
low, in comparison with 20% and 19% respectively for the 
negative and positive deviations. 

Another similar illustration is shown by Rietz in P:775, 
which again indicates that the normal curve gives a reasonable 
representation when positive and negative deviations are con- 
sidered together, even though the approximation is not good 
when one side (either positive or negative) is taken alone. 

It will accordingly be realized from these examples that the 
applicability of the ‘‘normal” theory should be tested in any 
practical case where q (or p) is small but n large enough that nq 
(or np) is about 10 or less (cf. the diagram on p. 267; C; 4). 
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C; IS. The Practical Applicability of Edgeworth’s Generalized 
Law of Error 

The ability of Edgeworth's curves to represent distributions 
which are not very skew has been amply demonstrated by 
examples which may be found in his original papers, in H:f07: 
329-334 (where in (52) c = 1.683 and j = .0728), in H:57:39 
(where c = 3.623 'and j = .06), and in P:5;^:134-7. They will 
generally be applicable to the same types of data which can be 
graduated by the Gram-Charlier Type A series. Under con- 
ditions of marked skewness, however, Pearson's system or the 
Poisson-Charlier Type B curve may be expected to give more 
satisfactory graduations. 

It should be noted that theoretically 7 = — , so that the fitting 

might be performed by the method of moments (see Chapter 
VIII, and H:Jf 07:330-4 for a numerical example). In practice, 
however, that method does not always give good results (see 
H:57:40). Edgeworth consequently devoted much effort to 
the development of alternatives — particularly {i) a “method of 
percentiles", (ii) a process based on the condition that the values 
of the constants should minimize the improbability of the 
observations arising at random from a curve of the given form, 
and {Hi) his “method of translation" — of which an account is 
given in H:7^;^:55-78 and 85. 


C; 16 . Applications of the Gram-Charlier Type A and Poisson- 
Charlier Type B Series 

An example of using successively one, two, and three terms 
of Type A is given in P:f74:117-8. Several graduations by the 
Type A and the Type B series arc discussed by Elderton in 
P:5^:134-140, together with valuable comparisons of the results 
which can be obtained alternatively by Edgeworth's or Pearson's 
systems. A very detailed examination of the practical utility 
of both Types A and B, and the technique of fitting, is available 
in P:5^:215 et seq. 
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Although Type A certainly can be made to assume a very 
skew shape — as is shown at P:S^:226-232 for a distribution of 
six values only — it would appear to be generally conceded that 
it is not of great practical service in such cases, and that then 
either the method of logarithmic transformation with Type A, 
or the adoption of Type B, is likely to be preferable (see es- 
pecially P:5^:235-260, and P:S;^:131-140). 

An objection which has been levelled against Type A is 
that the series expansion sometimes gives rise to negative fre- 
quencies in dealing with fairly skew distributions (P:S;?:140, 
and P:i^0:48). The advocates of the Gram-Poisson-Charlier 
school, however, reply that in reality this is a matter of little 
practical importance, because in any event the observations at 
the ends of the distribution are very small (P:5&:217). 


C; 17. The Practical Use of Pearson’s Frequency Curves in 
Actuarial Work 

So far at least as readers of English are concerned, the 
Pearsonian system of frequency curves far overshadows Edge- 
worth’s and the Scandinavian methods as an essentially practical 
means of representing statistical frequency distributions which 
occur in practice. Not unnaturally, controversy has at times 
surrounded the extensive applications which have been made by 
Pearson and his followers, and some supporters of the Scandi- 
navian school have advanced the claim that the Gram-Charlier 
or Poisson-Charlier series will deal more adequately with difficult 
cases (cf. P:5^:184-5, 216, and 232-4). The comparatively 
simple manner, however, in which the appropriate Pearson curve 
can first be selected by the use of the ‘‘criterion” (68), and then 
fitted by the ‘‘method of moments” (see p. 97 here), has led to 
the wide acceptance of the system in many fields, and in actuarial 
work has stimulated several interesting developments which are 
of particular importance to the actuarial student. 

The Direct Representation of Actuarial Data by Pearson's Curves 

The Type I curve is useful as a means of representing a wide 
variety of actuarial distributions. Being related to the skew 
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point binomial (and so being, in fact, capable of expressing the 
terms of the point binomial very closely even when n is as small 
as 6 — see P:5i:46), the curve in its skew bell-shaped form often 
resembles the numbers exposed to risk in a mortality or similar 
experience (see H: 7^ -.Diagram 1, P:5i:47 and 55-57, and 
P:S^:62). In its bell-shaped form it may also be employed to 
describe the actual deaths (see H:7P: Diagram 1, and H:^^) 
or the ^‘entrants** (see -.diagram, and P:5jf:6 and 47) in a 
mortality experience, and such diverse functions as the number 
of marriages and the rates of marriage, or disability retirements, 
the average number of children, and the cost of their pensions, 
in a pension fund (P:S1 :47), or the distributions of sums assured 
or premiums by age groups (H:f<5i:35). It was also used 
extensively in ll:122:Z22 to represent the age distributions of 
the populations of India. A numerical example of the J-shaped 
form may be found in P:5^:125-6, while the U-shaped curve is 
noted in P.-5;^:112, and the twisted J-shape in P:S^:111. 

The second main type. No. IV, is not particularly useful in 
actuarial work, since it is unlimited in both directions, and the 
diminishing rate at which the values decrease at the ends is not 
encountered often except in dealing with a function like the “rate 
of withdrawal Numerical examples are shown in P:5^:68and 
137, and P:;^0:7-ll, and its applicability for representing the 
distributions of sums assured or premiums is indicated in H:181: 
35. The work involved in calculating the ordinates, which 
previously had always made Type IV the most troublesome of 
all the Pearson curves to fit, has now been greatly facilitated by 
the publication of the tables noted at P :20, 

The skew hell-shaped form of No. VI — the third main type — 
has been illustrated in P:5^:77 as a representation of the number 
of “entrants’* in a mortality experience, and is suggested in 
11:181 :Z5 for sums assured or premiums by age. In general, 
however, its utility is reduced by the fact that its range is limited 
in one direction only (cf, P:5f :49). 

Coming now to the Transition Types, the Normal Curve has 
been used very widely, of course, in much of the basic theory 
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with which this study deals. Its utility (despite its theoretically 
unlimited range) as an approximation to the terms of the sym- 
metrical point binomial is illustrated numerically in P:5i:44, 
and its fitting to certain actuarial data (sums assured with 
bonuses, and reserves, grouped by office years of birth) by the 
method of moments is shown in P:S;?:81. It may be employed 
also to represent approximately a hypothetical (though not the 
actual) exposed to risk in the special method of determining the 
Makeham constant c which is described on p. 318 here. Another 
ingenious method is illustrated by Sir G. F. Hardy in P:5i:91-98 
(see also H:Jf;^;^:389), which uses the Normal Curve in the 
representation of an asymmetrical series, consisting of only a 
few groups, by taking the proportionate numbers of exposed to 

If* «« 

risk, deaths, living, etc., above age t as— 7= e dt^ and then 

V ttJ -CO 

adjusting the values of z on the assumption of constant third or 
fifth differences. 

The transition Type II in its bell-shaped form, and with its 
advantage (unlike the Normal Curve) of being limited in both 
directions, is also closely related to the symmetrical point 
binomial and to the Normal Curve, as may be seen from the 
examples in P:5^:87-89 and 134, and from ,P:Jf :42-43 and 44. 
Being of limited range, its values naturally decrease at the ends 
more rapidly than those of the Normal Curve. Its utility for 
representing the distributions of sums assured or premiums is 
shown in :35. It should be noted that it assumes a sym- 

metrical U-shape when m is negative. 

The transition Type VII, being again symmetrical, is of rather 
limited application to actuarial data, though again it is suggested 
for the distributions of sums assured or premiums in :35. 

The curve differs from Type II in that its values decrease at the 
ends more slowly than the Normal Curve. 

The utility of the skew transition Type III is often limited 
by its tendency to run into a geometrical progression. An 
example of its application is given in P;S;?:93. It may be useful. 
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however, for the hypothetical representation of the exposed to 
risk in the indirect method of graduating rates of mortality, 
marriage, etc., which is discussed below. 

The skew transition Type V also has a restricted applicability 
to actuarial statistics for the same reason as that noted for 
Type IV, namely, the diminishing rate at which the values 
decrease at the ends. 

The remaining Types VIII to XII are not bell-shaped, and 
might sometimes be useful, as indicated by Elderton in ¥\S2\ 
104-112, if it should be desired to represent such data as maturi- 
ties of endowment assurances (Type VIII), the exposed to risk 
by duration (Type IX), withdrawals (Type XI), or withdrawals 
in select tables (Type XII). 

The Compounding of Frequency Curves for Actuarial Data 

It may be of interest to note that some cases arise in practice 
which obviously cannot be dealt with by any single frequency 
curve, but evidently might be represented by the addition of 
several such curves. The actual deaths in certain experiences, 
or the values of dx in the hypothetical community of a '‘life 
table”, for example, when tabulated for all ages from infancy to 
old age, might begin at a high figure (say 100,000 for ages 0-5), 
decrease sharply during the early years (to about 6,000 for ages 
10-15), then increase (steadily, or with a point of inflexion at 
about age 25 or 30) to a maximum (of perhaps 70,000 for ages 
73-78), and finally decline gradually to zero at about age 100. 
In such a distribution there would be at least two prominent 
maxima (in infancy and at ages 73-78), with perhaps a third 
slightly marked at the point of inflexion about age 25 or 30. A 
J -shaped curve therefore might be used for the sharply descend- 
ing portion in the earliest years, with a skew bell-shaped type 
thereafter, and perhaps a third small bell-shaped curve (sym- 
metrical or skew) to provide for the point of inflexion. This 
method of “compounding” frequency curves is illustrated, for 
the case of the English Life Table No. 4, by Karl Pearson's 
analysis of the life-table function (i*e., the numbers dying 
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per annum at the moment of attaining age x) into five super- 
imposed frequency curves representing old age, middle life, 
youth, childhood, and infancy respectively (see H:74, H:77, 
and V:102A%). 

Another example of compounding (though of a rather differ- 
ent kind) is to be found in Howell’s addition of two Normal 
Curves to a Makeham graduation in H:/5^:198. 

[The splitting of the distribution of actual deaths into a series 
of frequency curves has been advocated by Arne Fisher (H:^S0, 
and ¥1:137) as part of a process for determining, from propor- 
tionate death ratios by causes of death, the rates of mortality of 
an experience without any information concerning the ‘‘exposed 
to risk” from which the deaths arose. A summary of the method 
is stated in P:167:86 (footnote), where it is pointed out, however, 
that the procedure is necessarily unsafe.] 

The Indirect Graduation of Rates of Mortality, Sickness, Marriage, 
Withdrawal, etc,, by Pearson's Frequency Curves 

Rates of mortality by age, from infancy to the limit of life, 
usually take the form of a contorted U-shaped curve (see Figure 
19, p. 82), and cannot ordinarily be represented by a single 
frequency curve. Nor can sickness rates be so depicted. Rates 
of marriage according to age, on the other hand, and of with- 
drawal by age or duration, do assume a form much like the skew 
bell-shape of Type III. The fit resulting from a direct frequency- 
curve graduation of such rates is generally poor, however, 
mainly because weights are given 'to the values at the end of life 
equal to those assigned at the other ages where the ‘‘exposed to 
risk” are much more numerous, with the result that those end 
values exercise a disproportionate influence on the results (see, 
for example, columns (4) and (5) of P:S2:\n). 

It is consequently advisable to employ a method which will 
avoid this difficulty by giving due weight to the varying magni- 
tudes of the exposed to risk upon which the rates are based. 
Since in many cases both the exposed to risk and the deaths (or 
marriages, etc.) assume after the earliest ages the shape of a 
frequency curve like Type I, an obvious procedure would be to 
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fit a Type I curve separately to each, and thence to take the 
ratios of the graduated deaths to the graduated exposed as the 
finally graduated rates of mortality. An example may be found 
in a paper by Elderton, H:5S (see also 11:181 A-S). 

Such a process, however, suffers from the defect that the two 
separate graduations fail to make allowance for the fact that the 
fluctuations in the observed exposed to risk and deaths are not 
independent. The method can therefore be improved — as seems 
to have been suggested first by Calderon (H:7^:164 et seq.,and 
11:89:521 ) — by * ‘adopting a formula for the deaths whereby 
the rate of mortality would be incorporated into the expression 
for the death curve, and connecting it with the exposed curve, 
so that the two were graduated together rather than separately**. 
On this principle, accordingly, we should first represent the 
exposed to risk by a suitable frequency curve; then we compute 
the deaths (so that they will correspond with the frequency-curve 
representation of the exposed) by multiplying the graduated 
exposed by the original rates of mortality; next graduate these 
recomputed deaths by a frequency curve; and lastly take the 
ratios of these graduated deaths to the graduated exposed as 
the finally graduated rates of mortality. Since this representa- 
tion of the exposed by a frequency curve is intended to provide, 
in effect, only a series of approximate weights, it will be sufficient 
to adopt merely a hypothetical (rather than a fitted) curve for 
that purpose, such as even a Normal Curve for which the or- 
dinates are readily available. 

The first published experiments with this simplified technique 
were given by Elderton in H :83, and two examples are shown in 
P:5^:116 and 118 for the graduation of marriage rates and 
death rates — the recomputed marriages being graduated by 
Type III, and the recomputed deaths by Type I. 

Illustrations are also available in 11:181 of using colog px or 
npx instead of the rate of mortality g®, and more elaborate dis- 
cussions of the underlying theories may be found in li:14S and 
11:181. In H:iS7:234 an illustration is also shown in which the 
exposed are represented by the Normal Curve and the recom- 
puted deaths are graduated by the Gram-Charlier Type A series 
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instead of by one of Pearson’s curves. In H \181 ;33 the principle 
here under discussion is again applied with the Normal Curve 
for the exposed, but with the survivors (instead of the recom- 
puted deaths), obtained by multiplying the hypothetical exposed 
by \Qpx (instead of g*), being graduated alternatively by the 
Gram-Charlier Type A series and Pearson’s Type I curve for 
purposes of comparison. 

The Determination of Makeham's Constant c by means of Fre- 
quency Curves 

The principle just described is of special value in the calcula- 
tion of the constant c in Makeham’s formula for the force of 
mortality, namely, iXx-A-{-Bc^—A-]rBe^^ where X=loge c, 

Calderon was the first (H:7^:164 and 192) to publish the 
suggestion, as the result of work with G. F. Hardy on the gradu- 
ations of the British Offices* Annuitants’ Experience, 1863-93 
(loc. cit., 169 and 191). His original idea in that paper was to 
use a point binomial for the exposed to risk, which should be 
arranged in not more than five or six broad groups (of at least 
10 ages in each group) in order to give a reasonable fit (see P:5/ : 
67 and 134). The expression for the ^Vecomputed deaths”, 
being then a point binomial (a+P)^ multiplied by the force of 
mortality, evidently assumes the form {a+py{A-\-Bc^); the 
moment relations are do—AEo+BEo* for the total, and similarly 
for the first and second moments, where 6 denotes the moments 
of the recomputed deaths, E those for the exposed, and £' those 
of the exposed multiplied by c*. From the resulting expressions 
the value of c may be deduced easily, as shown in H:7P:164-5 
and P:55:88-90. 

A development of Calderon’s point binomial method was 
next employed by Elderton in H:55 (see also H:f5jf:4), and is 
now to be found conveniently at P:5j^:120-3. The improve- 
ment uses the Normal Curve as a hypothetical representation of 
the exposed, and is easily applied since the ordinates of that 
curve are readily available in tables such as F:97. The neces- 
sary formulae are demonstrated clearly in P:5;^:120. The 
excellent results of a numerical application to theO^^^®^ mortality 
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table are shown in F:S2:l22t and another illustration is given 
in P:S5:256. 

Pearson’s Type III curve has likewise been suggested by 
Hardy for the hypothetical exposed, since in combination with 
it also leads to a simple expression from which 
X( = log«c) can be found. The formulae are proved in P:dl :134, 
and are stated again in P:5^:87-88. The ordinates of the 
Type III curve are now available (cf. P:55:256, footnote) in 
P:^;^0:63. 

[It may further be noted here that Hardy has given, in 
P:5i:135, the necessary formulae for this method when the 
exposed are represented by the form y =^x“(l where 
X represents a proportionate part of the range of the curve so 
that it varies between 0 and 1, as in (75) here.] 

C; 18 . The Principle of “Uniform Seniority’’ 

The particular advantage of Makeham’s formula (83) for 
actuarial purposes lies in its possession of the valuable property 
of ^‘uniform seniority”, by which an annuity on n joint lives of 
any ages may be computed by the substitution of the same num- 
ber of lives, all of the same age, in accordance with the relation 

where = - (c^+c ^+, . .). 
n 

Makeham’s second expression, (84), permits the use of a 
modified and less convenient uniform seniority method, involving 
a special rate of interest for the substituted annuity at equal 
ages (see 11:187:535). 

Hardy pointed out that formula (95) also preserves the 
principle in a modified form, and in F:85:4:13 and P:^.^:542 
its application in the case of the modified double and triple 
geometric expressions (96) and (97) is examined. 

The name “uniform seniority” arises from the fact that the 
calculation of the equivalent equal ages, w, is made, in effect, 
by an addition to the youngest age. It has been observed, 
however (P:^5:413 and P:P4’547), that its use may be facilitated 
if it is applied as a method of “uniform juniority”, i.e., deduction 
from the oldest rather than addition to the youngest age. 


22 
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C; 19. The Application and Fitting of the Verhulst-Pearl-Reed 
(the ‘‘Logistic”) Curve of Population Growth 

Very complete statistical illustrations of the ‘logistic” curve 
have been worked out by several investigators — sometimes in 
order to examine its acceptability as a “law” of population 
growth, on other occasions to illustrate its dangers as an instru- 
ment of prediction, and again to investigate the difficulties of 
fitting such a transcendental form. 

Pearl and Reed have been in the forefront amongst those who 
would claim some measure of universality for the method. 
Summarizing in P:^^:637, for example, the results of a number 
of their papers, Pearl shows the logistic (sometimes in its original 
symmetrical form, and sometimes as the sum of two such curves 
or in the generalized form (103) to give effect to cycles of growth) 
for 16 different countries, and concludes that it “does in fact 
describe the known (or, in the case of the world, estimated) 
population growth with great precision and fidelity”, so that 
“this evidence makes it probable that the curve is at least a first 
approximation to a descriptive law of population growth”. It 
is of interest, also, to note that Pearl’s rediscovery of this form 
of curve was supported by observations of the growing numbers 
of the fruit-fly Drosophila melanogaster enclosed for breeding 
and observation in the “limited universe” of a milk-bottle, 
which followed the curve quite closely — a type of evidence 
which prompted Sir Athelstane Baines to observe {P:176:6\) 
that “so far as the habits of the Drosophila are concerned, I must 
confess that I sympathize with Dickens’ Eugene Wrayburn [the 
calm but briefless barrister in ‘Our Mutual Friend’] who, when 
taunted with the example of the ant and the bee, ‘protested, on 
principle, as a biped’ ”. 

Other valuable statistical examinations are shown in P:176: 
7-23 by Yule, and in P:105 by Reed and Pearl. In 
Schultz has given further illustrations, and has emphasized the 
importance of the standard errors of the forecast values (see also 
P:^0:311). In connection with the use of the logistic for long- 
range prediction, his conclusion, indeed, may be quoted: “There 
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is no necessary relation between the goodness of fit of a curve to 
past observations and its reliability for forecasting purposes. A 
curve may fit the data for the past 100 years with a high degree 
of accuracy, and yet fail to predict the situation for the next year 
or so**. 

Another extensive series of examples for 29 populations is 
provided by Wilson and Puffer in 'P:159, which draws attention 
to cases in which the logistic gives eventually an infinite popula- 
tion. 

An excellent contribution, also, is the paper (P:23) by 
Cramer and Wold. In addition to a very useful examination 
of the characteristics and methods of fitting the logistic, their 
general conclusion (loc. cit., 203) should be noted: 'The method 
gives good results in cases when the data are not too few in 
number and show an evident trend of the logistic type. In other 
cases, the advantage of fitting the logistic curve to statistical 
data seems to us somewhat doubtful**. 

Methods of Fitting the Logistic 

The difficulties of fitting the curve have been investigated 
exhaustively. 

(1) Verhulst, and Pearl and Reed, have used simply 3, or 5, 
equidistant ordinates (see V\176A^\ P:P^:576; and P:;07:242). 

(2) Yule has suggested summing the reciprocals in 3 succes- 
sive groups, and thus equating the harmonic means of the actual 
and calculated populations in each group (P:i7^:7 and 51). 

(3) The same author also investigated the fitting of the form 
(102b) in B; 20 by finding L and a from the fitting of a straight 
line to the proportionate increases over successive intervals of 
time by the method of least squares, and then determining jS 
from sums of the reciprocals of the given populations (P:l 7^:52; 
see also particularly P:^0:293). 

(4) Will has illustrated a method based on finite differences 
(see p. 172; A; 17) in H:770:177. 

(6) The application of the method of least squares is dis- 
cussed at p. 327 ; C; 21. 
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C; 20. The Meaning and Use of the ^^Weights” in forming the 
^^Normal Equations” in the Method of Least Squares 

Although the * 'weight” is, in accordance with its definition 
by (107) and (108), the reciprocal of the square of the probable 
error (to adopt here the use of X, rather than a or in accordance 
with the phraseology usually associated with the classical ex- 
planations of the method of least squares), it should be observed 
that it obviously may also be interpreted as expressing "the 
number of observations of weight unity of which it is the equiva- 
lent”. For, in forming the "normal equations” (110), if the 
"observation equation”, when x=^l and the weight is TFi, were 
merely written down W\ times, each time with weight unity, 
and if in general the observation equation for x=s is merely 
written down times, each time with unit weight, it will be 
clear that we obtain exactly the same scheme of equations as if 
when x^\ the observation equation is merely multiplied by Wi 
and the observation equation when x=s is multiplied by Wg] 
and the partial differential coefficients, when equated to 0, will 
give exactly the set of normal equations (110). 


The student must here be warned to watch for an alternative 
mode bf stating the rule of formation of the normal equations, 
since otherwise it may cause confusion. The condition of least 
squares, by (109), is that ^[Wx(jl—fxy] must be a minimum, 
where by (108) — and again adopting here the classical use of the 


probable error — Ws = • This, however, can also be written that 


is to be a minimum, where , 

which means that 2 must be minimized where 


Wx 


1 


The partial differentiation with regard to the several 


unknowns in the case olfl-a+Px+yx^ + then leads to the 

following (omitting the common factor 2): 
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S[w*{«;*(a+i85c+'y^2 + . . . ) =0] 

t[xWx[W:,{a+fix-\-yx^ + , , . ) —w^fx]] =0> (110a) 

^[x?Wx{Wx{a+&x+yx^-\- . . -Wxfx]] =o) 

etc. 

These are of course the same as the normal equations (110), 
since wl = Wx* In order that they may be written down in the 
form (110a), however, they evidently require the following slight 
re-wording of the verbal rule previously given (in Chapter VIII) : 
“Set down the ‘observation equation’ (a+i3jc+7JC^ + . . .)~/x = 0 
for each value of x, and prepare it by multiplying it by the 

square root of its weight, i.e., hyWx = ^ Wx = — . Form the normal 

equation for the unknown a by then multiplying each observation 
equation, thus prepared, by the coefficient of a in that prepared 
equation, and adding the results; similarly form the normal 
equation for by multiplying each prepared observation equation 
by the coefficient of in that equation, and adding the results; 
and so on”. It will be noted that in the verbal rule just given 
each observation equation itself is first multiplied by the square 
root of the weight, so that y/ Wx{=wf) becomes part of the 
coefficient of the unknown by which the equations are multiplied 
in forming the normal equations. An example of this procedure 
is given in H 4^:164. 

It is important for the student to grasp thoroughly the 
difference between the systems (110) and (110a), and also to 

understand that in both systems the “weight” is so 

that Wx = — is the square root of the weight. Actuarial stu- 

dents, in particular, should note that — contrary to the above 
accepted definitions of the method of least squares — in P:Jjf:119 

and 129, and P:1 7^:46, the weight is defined as — , so that in 

Xx 

reality the term “weight” is there used for Wx instead of for the 
usual Wx, 
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C ; 21. The Practical Application of the Method of Least Squares 

The Systematic Solution of Linear Normal Equations 

A large proportion of the literature and text-book descriptions 
of the method of least squares concerns itself with the systematic 
solution of the linear **normal equations* ^ and the controls of 
the necessary calculations. As the equations to be solved consti- 
tute a linear simultaneous set, the solutions can be expressed 
very conveniently by means of determinants (see P:M5:231). 
In the explanations two special notations are generally employed : 
(a) The term ‘^residual’*, or 'Residual error’*, with the symbol Vx^ 
is used for the difference between the fitted and observed values, 
(^) following Gauss (cf. P:P0:49), square brackets 
are employed to denote summation, in the form [aa] = 
a5+a2 + . . . +a^, [a6]=ai6i+a262+. . . +^A» etc. It may also 
be noted that in many of the classical texts the unknowns are 
denoted by :x:, y, 0, . . . . 

Since these expositions of systematic methods are so complete 
(with many numerical examples) and are so readily available, it 
will be adequate here to give the following references only: 

(1) For the “method of determinants** see V:155:2i\ and 
P:^S:99. 

(2) Gauss’s “method of substitution** (H:f7:Supplementum) 
is fully explained in P;Jf55:234, P:i5:90, and P:P0:176. 

(3) “Doolittle’s method** {ii:56), which is a modification of 
Gauss’s, is described in P:^S:96 and P:;^5:104 et seq. 

(4) For the case when the number of unknowns is large, the 
“method of successive approximation (iteration)** de- 
veloped by Gauss, Seidel, and Jacobi is given in V:166 :255. 

(6) Other variations are noted in P;/55:236 and P.^8:106, 
(6) An electrical machine has recently been devised for the 
automatic solution of large sets of linear simultaneous 
equations (¥1:179), 

Some of these processes are particularly convenient on account 
of the manner in which they facilitate the calculation of the 
“standard errors’* of the unknown parameters as well as the 
parameters themselves (for examples, see P:1 55:209 et seq., 
P:iS:103, P:1S4:U4: et seq., and P:I88). 
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The Determination of the Constants in Makeham's Formula (83) 

One of the main curve-fitting problems with which actuaries 
are concerned is the determination of the constants in Make- 
ham’s formula (83). For the purposes of this discussion (83) 
will be written yg=A+Bc^ where yx represents any one of the 
functions mx, or colog and the unknowns -4, 5, and c 

may therefore be supposed to stand for the Makeham constants 
in any of the formulae for those several functions. [This Make- 
ham constant c, which is used here in conformity with universal 
practice, of course has nothing to do with the c of the Normal 
Curve (11) and of the weight Wx defined by (107).] Then, by 
(109), the method of least squares requires that 

'^WxiiA +5^:*) '-fxY] must be a minimum .... (128) 

and the “normal equations’* resulting from the partial dif- 
ferentiations with regard to A ^ B, and c are 

A (2 Wx) +JB(2c* Wx) - Vx Wx = 0] 

A(Zc^Wx)+B(Zc^^Wx)-^fxC^Wx =0 (129) 

A (Zxc^Wx) +B(Zxc^^Wx) -ZfxXc^Wx^O) 

These equations are not linear with regard to the Makeham 
constant c; it is therefore necessary either to adopt the method 
of approximation stated on p. 94 and at p. 241; B; 22, or to em- 
ploy some other device in order to reach linear equations for 
solution. 

The classical method of approximation seems to have been 
illustrated first by Chandler in H:44:169. Later Karup also 
gave a thorough application (H:^^). 

Since the non-linear nature of the normal equations is due to 
the manner in which c is involved, and as logiocis often found to lie 
between about .038 and .04, another obvious procedure is to 
adopt a succession of trial values for c, and to determine the 
corresponding values of A and B from the first two of the 
equations (129). In order, however, to reach the absolute 
minimum for (128) it would be necessary to adopt some criterion 
for the degree of approximation secured, as observed by Ros- 
manith (H:f 0^:329). 

A very valuable method which is less onerous than the pre- 
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ceding has been proposed by Steffensen (P:i5^). He eliminates 
A and B from (129) by determinants, and obtains a single 
equation in c, namely 




llxc^W:, VxXC^Wx 


= 0 ....(130) 


Since log c = .04 is usually a close approximation, the numerical 
solution of this equation can generally be effected (to 5 figures 
for c) from 3 trial values for log c and subsequent interpolation 
to find the value for which D=0. With c thus determined, 
A and B follow from any two of the equations (129). An example 
is given in PilSd, and the excellent results obtained are shown 
by the comparisons in P:S3:253. 


The “weights” in all the preceding methods will be taken in 
accordance with formulae (111), (112), and (113), or the other 
particular applications to mortality functions discussed at 

jo' 

p. 272; C; 7. The weight of the observed ml as — — 

Wa;(l— mx) 

was thus used in Chandler’s graduation in H:^4-164, or may 
be taken as 


(Wx)2 


as suggested in P:5i:100. The weight of the 


observed colog pl^ being — was used in H:1(1S:256, and in 

l^^filpiaisen’s least squares graduation of P:185, Although it is 
not usually required for a Makeham graduation, the weight of 

f Et 

the observed would similarly be — — . 

Pxqx 


It must also be remembered, as pointed out in the text, that 
if the series is to be fitted on the assumption of uniform weights, 
but is itself transformed by a device such as the taking of its 
logarithm, then that logarithmic series logj^ must have weights 
(jxY assigned to it. This would be of importance if the fitting 
of such generalized Makeham forms for nx as (86) -(88) were to 
be undertaken (as suggested in P:144’287) by fitting log /i* by 
the method of least squares. 
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The fitting of Makeham’s formula (83) in its U or log /» 
form evidently involves, with any method, more awkward ex- 
pressions than those which emerge from m*, Mx+j* or colog px* 
It may be noted, nevertheless, that the fitting of log lx, or Zlog /» 
in decennial groups, by least squares (but in each case without 
weights) has been discussed in H:120:231. 

The Determination of the Constants of the Logistic Curve 

The fitting by the method of least squares of the transcen- 
dental * ‘logistic** curve has been discussed in several forms. If 
it be assumed in (102) that the enumerated populations — here 

— are equally well determined, so that the weights of the several 
observations can be taken as uniform, the least squares condition 
requires that 

must be a minimum. . . . (102d) 

The necessary differentiations with respect to the unknowns 
here lead to non-linear “normal equations**. The classical 
method of approximation for such a case (as stated on p. 94 and 
at p. 241 ; B ; 22) by which preliminary values are first found and 
corrections then determined by the method of least squares, is 
set out fully by Schultz in P:l;^4’164. The problem has also 
been examined in a valuable paper by Wilson and Puffer fP:159\ 
see also P:d^:108), where it is observed that disappointing 
results may arise from the neglect of terms of the second o^er 
in the method of approximation as ordinarily used. ^ : 

As in the case of Steffensen*s method for dealing with the 
non-linear normal equations of the Makeham function, a useful 
least squares technique, which involves a comparatively small 
amount of work, has been evolved by Cram6r and Wold (P :2S : 
201). In (102), crude approximate values of the asymptotes 
A and B are first determined by graphical inspection of the data. 
The approximate position of the inflexion point, r, in this sym- 

A +B 

metrical curve then follows from P-= . For we have, 

2 

from (102) by differentiation, , whence 

dt B-A 
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Next regarding r and k as fixed, they 

l-r \ 4 / 

determine the equations of condition for A and B to satisfy the 
usual unweighted least squares criterion. Using then the fixed 
r, but different values of k near its approximate value, they find 
by interpolation the value of k for the fixed r which produces 
the minimum. Repeating the process for a number (generally 
only 2) further values of r, it is possible to find with sufficient 
accuracy the value of r for the required absolute minimum, and 
thence 5, and A by interpolation. A numerical example is 
shown, and tables are supplied to facilitate the process. 

Another method of introducing the least squares principle, 
with the logistic in the form (102a), was used by Pearl and Reed 
(P:^^:579 et al.; P:1^4AQS; and P:;^7:245). First determining 
from three equidistant points an approximate value for the 
exponent of e, they multiply through by the denominator of the 
expression to be minimized, and then take the partial derivatives 
therefrom in order to make the unweighted sum of the squares 
of the resulting residuals a minimum. This process, however — 
even though it may give good results — is not the true least 
squares procedure. The minimizing of (102d) is based on the 
assumption of uniform weights for Ft] but the multiplication 
through by the denominator changes the system entirely, and 
introduces implicitly a series of different weights, so that — ^as 
Schultz observes (P:ij^4:164) — '‘in fact it is difficult to give 
meaning to the residuals which they are minimizing'* (cf. also 
p. 96 here). 



C; 22. Practical Applications of the Unweighted Method of 
Moments in Actuarial Work 

The relationship between the method of least squares and the 
method of moments which is discussed in the text has an im- 
portant bearing upon the extent to which the method of moments 
should be applied in practice — as is very commonly done — in its 
unweighted form. 

The excellent results obtainable by the unweighted method of 
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moments in the fitting of curves, such as Pearson’s, to frequency 
distributions are to be anticipated from the principles brought 
out in that discussion. The application of the unweighted 
moment equations in the fitting of curves to series (such as 
or colog />*) which cannot be viewed as frequency distributions 
should, however, be undertaken with some caution; the entire 
omission of the weights clearly may lead to a system of fitting 
essentially different from that of strict least squares, while 
certain alterations in the method of stating the basic equations 
may result, in effect, in the use of approximate weights which 
are in reasonable conformity with those of least squares. These 
considerations are of particular importance in connection with 
some of the procedures by which it is said that the '‘method of 
moments” (without any reference to whether weights are or 
are not used) has been applied in the fitting of Makeham’s 
formula (83). 

One example of the use of unweighted moments is the applica- 
tion of the moment principle directly to the graduation by 
Makeham’s form colog = without any introduction of 

weights, in H:5;S:11, and the valuable discussions and detailed 
illustrations of that procedure in 11\1S5 and 1?\1S6 (see also 
P:f6;?:547). The mode of computation followed is that of 
successive summation, so that the equations to be solved for 
a, jS, and c are the equations for Scolog px> S^colog px, and 
S^colog px over the range of ages selected (cf. p. 257; B; 27, and 
see H:7S5^:656 and P:f 50:3-6 for the precise development of the 
equations in this case). Since the method of course leads (just 
as does the method of least squares) to an awkward equation for 
the determination of c, it should be noted that Steffensen 
(P:JfS0:7) performs the solution for c by elimination with de- 
terminants and subsequent interpolation from trial values 
(cf. his similar method for least squares stated at equation (130) 
on p. 326; C; 21 here). Trachtenberg (H :1S6) also gives a useful 
table by which the solution by trial of the equation for c may be 
accomplished when the range of ages is from 20 to 79 inclusive. 
Steffensen’s careful numerical application of the method to the 
Danish experience — ^which follows Makeham’s formula 

closely (see F:1S6:2) — shows that the results are not very satis- 
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factory (P:fS^:16), and that the failure can be attributed to the 
manner in which this direct application of the method of mo- 
ments to colog px ignores the weights of the observations. 

The unsatisfactory results which are thus to be anticipated 
from colog/>» = a+i3c* when the constants are obtained directly 
from the unweighted moment equations are greatly improved — 
as again is to be expected from the principles stated in the text — 
when the equation just given is applied in a form which in effect 
deals, by unweighted moments, with a frequency distribution 

0 

instead of with a series of rates. For colog pxhy^x-ir \^ — 

if for this, by (83), we use the Makeham form we 

may write Exj^\{A+Bc^^^)—6x\ we are then dealing with a 
frequency distribution, of deaths 0*, instead of a series of rates 
colog pxi and (in accordance with the principles explained in the 
text) may use unweighted moments as being likely to give results 
close to those obtainable from colog px by strictly weighted least 
squares. Sir G. F. Hardy made extensive use of this principle — 
performing the calculations by his method of successive sum- 
mations as the equivalent of the method of moments, and employ- 
ing trial values of c so that only the first and second summations 
are required in order to determine A and B from two linear 
equations for each such value of c (see P:5f :65 and 0^:292). 
[In some applications Hardy also subsequently introduced an 
adjustment to correct for the approximate nature of the relation 
colog so that the constants a and for colog px could 

be derived from the graduation as well as A and B in 
(see H :95 :502 and H :1 06:293),] In the case of the experi- 
ence (where, as already noted, unweighted moments applied 
directly to colog px gave unsatisfactory results) this procedure of 
Hardy (which transforms the process into an application of 
unweighted moments to a frequency distribution) showed, as 
would be expected, results very close to those of strictly weighted 
least squares (see methods Nos. 1 and 2 in P:SS:253). 

The methods of the preceding paragraphs contemplate the 
use of Makeham’s formula in its colog px or form. The 
expression log /,«log k+x log 5 +c* log g will evidently be less 
easily handled; it may be noted, nevertheless, that the applica- 
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tion of unweighted moments by integration in this case was 
worked out fully by Karl Pearson in H:S^:298, and has been 
used by Glover (H;^4) and Thompson (H:l;^0:226), while 
a simpler method involving summation instead of integration 
is given at P:5^:95. 

C ; 23. The Practical Application of the Minimum-x^ Method 
to Actuarial Data 

An interesting illustration of the principle of minimizing 
in accordance with condition (122) is shown by Cram6r and Wold, 
in P:;^5:173, for the case of a Makeham graduation. Remember- 
ing that the method is applicable particularly when the data 
are in the form of a frequency distribution, and that the observed 
deaths, 0^, may be so considered, it will evidently be advisable 
to make, in effect, a graduation of — for a ratio such as 
0 

Mx+i = — ^ I which by Makeham’s formula (83) may be taken in 

the form A is not a frequency distribution. We should 

therefore deal with the frequency distribution of deaths, 
Ba:=Ex+^{A+Bc^'^^); in (122), consequently, we put and 
and the expression to be minimized 

becomes j , where A, B, and c are 

the unknowns. 

Since here, as in the method of least squares, the differentia- 
tion with regard to c leads to a non-linear equation for solution, 
it will be advisable to adopt certain trial values for c, and thence 
to determine the best value by interpolation (cf. Steffensen's 
method, p. 326; C; 21, and P:^5:176). For each such value of 
c, the differentiations with regard to A and B when equated to 
zero give easily (dropping the multiplier 2) 

as shown, in slightly different notation, in P:^5:173. 
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The numerical illustrations given by Cramer and Wold 
indicate, as would be expected (cf. P:1 57:358, quoted on p. 99 
here), that the values ol c, A, and B obtained by this method are 
insignificantly different from the corresponding equations of the 
unweighted method of moments, when the latter also is applied 
(see p. 330; C; 22) to graduate the frequency distribution of 
deaths in the form +5^*'^^). 


C;24. Applications of the Tests of Goodness of Fit based on 
(i) Changes of Sign, and (ii) Standard Deviations 

AT—f— 1 

(i) The following example of the formula — 

probable error (if it may be assumed that the occurrences are 
independent) for a non-periodic series with the first and last 
signs omitted (as given at p. 249; B; 25), was shown by De 
Forest (H:^5:34) in order to test the deviations between certain 
observed and graduated rates of mortality over a range of 70 
ages: 


Number of Like 
Signs in Sequence 
«= r 

Expected Sequences 
N-r-1 
- 2^+1 

Observed 

Sequences 

1 

17.0 ±2.4 

26 

2 

8.4 ±1.8 

10 

3 

4.1 ±1.3 

6 

4 

2.0±0.9 

1 

5 

1.0±0.7 

0 

6 

.5±0.5 

0 

7 

.2±0.3 

0 


Here the excess in the observed sequences for r = 1, 2, and 3, and 
the deficiencies thereafter, indicate that the graduated values 
follow the observations rather too closely, i.e., the original series 
has not been smoothed quite enough. This may be seen also 
from the fact that, even if adjustment were made in respect of 
the first and last signs in order to deal with the series as a periodic 
one, the signs falling within groups of 1 or 2 would total about 36, 
and are thus much greater than — ^instead of being approximately 
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equal to — the 7 in groups of more than 2. The sequences of all 
orders, being 43, are similarly too large in comparison with the 

N 

limit — =35 even if the series were periodic. 

2 

N 

Two examples of the similar method based on (but with- 
out knowledge of De Forest's work) may be found in H:^^:331 
and F:135:5 (see also a reference thereto in P:^;^J:28). 

(ii) In applying the test for a permissible graduation 
based on ±30- or db2<r db4X or ±3X), or the test for a satis- 
factory graduation based on fer, it is necessary only to compute 
/ // // 

- \/ - at each age (where denotes the exposed to 

* K e; 

risk, the ungraduated rate of mortality, and qx the graduated 
rate — see par (Hi); C; 7), and then to compare the actual devi- 
ation with 3(t or 2a-, or with fo-, as the case may be. 

A numerical example of the 2a- method is shown at ¥1:160:12 
and 25. 

Alternatively, of course, the comparisons may be shown for 
the deaths instead of for the rates of mortality. The standard 
deviation to be used is then cr[d'^] Exp^q^ by formula (8) 
and par. (ii); C; 7, and the actual deviations to be compared 
with 3<r or 2a-, or with |a-, are (^*— O = (£x2*~^*). Illustra- 
tions of the io test are given in ¥1:90:128 and 155. The examina- 
tion is there made in 6-year age groups. It should be noted, 
however, that this customary method of using age groups may 
give misleading conclusions through the cancellation of positive 
and negative deviations within a group (cf. P:125:Q-7); it is 
therefore advisable in practice to make the comparisons age by 
age. 

C; 25. Illustrations of the Test 

(1) The Binomial; Probabilities Known a priori; One Con- 
straint — Weldon's Dice Data 

A classical example which brings out clearly the technique 
of the x^ test, and also its relation to the simpler probable error 
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methods^ is afforded by the results of Weldon’s experiment in 
throwing 12 dice together, 26,306 times, and observing the 
number at each throw which turned up 5 or 6. With one 
unbiassed die the chance of 5 or 6 at a single throw is J; the 
theoretical frequencies when 12 unbiassed dice are thrown to- 
gether 26,306 times are therefore the terms of the binomial 
26306(i+f)^^, which are shown in column (2) of Table A. 
Weldon, however, found that the distribution which occurred 
was that shown in column (3). Knowing thus the “true” 
distribution to be expected from unbiassed dice, and having a 
series actually observed, the question which immediately arises 
is whether the observed series fits the theoretical series within 
such deviations as may be attributable to chance alone, or, on 
the other hand, whether it is incompatible with this hypothesis 
so that some cause may have operated to produce the results — 
as, for example, that all the dice may not have been unbiassed. 

TABLE A 


Number of 
Dice with 

5 or 6 

r 

Theoretical 

Frequency, 

ff 

Unbiassed Dice 
(/>*.3) 

Observed 

Frequency, 

/V 

fr 

(i) 

(2) 

(3) 

(4) 

0 

202.75 

185 

1.554 

1 

1216.50 

1149 

3.745 

2 

3345.37 

3265 

1.931 

3 

5575.61 

5475 

1.815 

4 

6272.56 

6114 

4.008 

5 

5018.05 

5194 

6.169 

6 

2927.20 

3067 

6.677 

7 

1254.51 

1331 

4.664 

8 


403 

.306 

9 

87.12 

105 

3.670 


13.07 

14] 


11 

1.19 

4 

.952 

12 

.05 

oj 



26306.02 

26306 

x‘ -35.491 
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In applying the test to this enquiry we find xo as in column 
(4) — the last two groups being merged with that for r = 10 to give 
frequencies not less than 10 throughout. There are thus 11 
groups; one constraint has been imposed by the equality of the 
totals, as in the fundamental derivation of (50) and (52), so 
that there are 11—1=10 ‘^degrees of freedom”; and from the 
tables of P and xo it is seen that for 10 degrees of freedom P lies 
well below .01 — being, in fact, about .0004 — when xo =35.491. 
It is therefore evident that the ”fit” is very poor, i.e., the ob- 
served series is not compatible with the hypothesis that the 
true frequencies would be those of column (2) resulting from 
unbiassed dice, since we should expect to get a set of deviations 
giving a value of x^ as large as or larger than that actually 
observed only about 4 times in 10,000 trials. The inferences 
to be drawn would therefore be (separately or in conjunction) 
that the dice were biassed, or that the method of throwing them 
was faulty (an unlikely circumstance, since special care was 
taken), or that a most improbable event actually occurred. 


(2) The Binomial; Probabilities Estimated from the Data; Two 
Constraints — Weldon's Dice Data 
Suppose now that we are presented merely with the observed 
frequencies of column (3), without any prior knowledge of the 
true probabilities as reflected in the frequencies of column (2). 
Or, as an alternative but really equivalent viewpoint, suppose 
that, being presented only with the observed frequencies, we 
are asked to examine whether they are consistent with the 
hypothesis that the true probability of throwing 5 or 6 with 
a particular die is not the unbiassed estimate i assumed above, 
but is instead a value to be estimated from the data. If we adopt 

u . . .u • u. j 2r/; 106602 

as the best estimate the weighted mean = = 

12(26306) 315672 

.3376986, the problem becomes that of comparing the ‘'goodness 
of fit” of the observed series shown again in column (2) of Table 
B, with the theoretical values resulting from the distribution 
26306(.337699 +.662301)^2, which are shown in column (3). The 
value of Xo is then found, as in column (4), to be 8.179. 


23 
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TABLE B 


Number of 
Dice with 

5 or 6 

r 

Observed 

Frequency, 

f'r 

Estimated 

Frequency, 

fr 

(/>-. 3376986) 

(f'r-fr)* 

fr 

(1) 

(2) 

(3) 

(4) 

0 

185 

187.38 

.030 

1 

1149 

1146.51 


2 

3265 

3215.24 

.770 

3 

5475 

5464.70 

.019 

4 

6114 

6269.35 

3.849 

6 

5194 

5114.65 

1.231 

6 


3042.54 

.197 

7 

1331 

1329.73 


8 


423.76 

1.017 

9 



.838 


14 

14.69 


11 

4 

1.36[ 

.222 

12 

0 

.06] 



26306 


x;=8.179 


The last three frequencies being grouped, as before, there are 
still 11 groups. It is to be noted especially, however, that there 
is now an additional ‘‘constraint” — making 2 in all — because p 
has been made the same in the observed and theoretical series, 
in addition to the equality of the totals. The degrees of freedom 
are consequently only 11—2=9. From the tables it is seen that 
P is .52 when d = 9 and xo =8.179. The ‘‘fit” is therefore good, 
i.e., there is, on this test, no reason to doubt the hypothesis that 
the probability for each die of throwing 5 or 6 was actually the 
biassed figure .3376986. 


(3) Frequency Curves — Number of Constraints 
It will be useful in connection with the concept of “degrees 
of freedom” to point out that the observed frequencies,/^, of data 
such as Weldon’s could be represented, or “graduated”, by some 
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analytical function, and that if then it were desired to examine 
the goodness of fit by the x* test, it would be accomplished by 
computing xo as in the preceding illustrations with, however, 
due allowance for the proper number of constraints. Thus if 11 
groups were still used, and Poisson's formula were fitted, there 
would be 11—2=9 degrees of freedom, since in determining the 
graduated ff two constraints result from equating the total 
number and the mean; the Normal Curve, fitted from the total, 
mean, and standard deviation, imposes 3 constraints; Pearson's 
Type III is fitted by using the total, the mean, /ig, and /ig, so that 
there are 4 constraints; Pearson's Main Types I, IV, and VI 
require /X 4 as well, thus increasing the number of constraints to 5; 
and the Gram-Charlier series similarly employs the total and 
the first four moments, with 5 constraints again. 


(4) The Graduation of Mortality Statistics; Makeham's 
Formula; Number of Constraints when c is Assumed 
The problem of graduating mortality statistics does not 
ordinarily present itself to the actuary as a mere graduation of a 
series of observed deaths. The data usually appear in the form 
of a column of exposed to risk, at each age x — which have 
been obtained by observing the exposures of a group of lives 
with due allowances for the fluctuations caused by new entrants 
and exitants — and another column of the corresponding observed 
deaths, at each age. This means that for each age x there 
are, in effect. Ex cases, of which ^re observed to die, and 
do not die. A graduation of the observed rate of mor- 


tality g' (-0 , or a function such as cologio px or 


is then generally made. If it is next desired to test the fit 
obtained, the customary procedure is to compute the expected 
deaths, say 6 x 1 by multiplying the original E'x cases by the 
graduated rate of mortality, say g*, and then to compare ol 
with the observed of the data. That is to say, the original 
data in reality consist (taking two age groups only as an example) 
of observations in the form set out in Table C, and a graduation 
might have produced the adjusted figures of Table D if it had 
been performed so that the total expected deaths, equal 
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the total actual deaths, This form of presentation Is known 
as a Contingency Table; in this particular case we have a *^2X2*' 
contingency table, since there are (to the left and above the 
double lines) 2 rows and 2 columns, giving 4 ‘'cells’* in all — 
a “fourfold” table. 


TABLE C — Observed Values 


Age Group 

e'die 

E'-e' 

X ^x 

do not die 

Total 

K 

20-24 

15 

1,734 

1,749 

25-29 

113 

15,876 

15,989 

Total .... 

128 

17,610 

17,738 


TABLE D — Graduated Values 


Age Group 

el die 

X ^ X 

do not die 

Total 

K 

20-24 

12 

1,737 

1,749 

25-29 

116 

15,873 

15,989 

Total .... 

128 

17,610 

17,738 


Now in such a table the is fixed, i.e., the totals of the rows 
are fixed ; and this is necessarily so in both the tables. Similarly 
when, as here, the graduation has made the total expected 
deaths equal to the total of the actual deaths (^'), it follows 
that the totals of the columns are also equal in the two tables. 
Under such circumstances, in a fourfold table, the fixing of one 
cell fixes them all — for example, if for the 20-24 group were 
to have emerged as 30, all the rest of the table could be written 
down at once from the fixed values of the totals of the two rows 
and the first column. Since the value in only one cell can here 
be assigned at will, it follows that there is only one “degree of 
freedom”. By similar reasoning it is easy to see, in general, 
that in a contingency table, with pq cells, the frequencies 
in the first p — \ columns and g — 1 rows can be determined 
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at will, and the remainder follow automatically from the totals, 
so that there are — 1)(? — 1) degrees of freedom. 


Let us now consider a more extended hypothetical example 
(taken from H:^0:155, with slight adjustment for illustrative 
purposes here, in order to make the totals of the actual and 
expected deaths precisely equal). Suppose, therefore, that the 
data consist of the exposed to risk, and actual deaths, 0^, 
of columns (2) and (3) of Table E, and that we wish to apply the 
method to test the fit of the graduated values stated in the 
second half of that table. 

Now if the deaths had been obtained by some graduation 
process which operated directly upon the observed deaths, 0^, 
alone, to produce simply the graduated values of column (6), 
and in so doing merely imposed the condition that the totals of 
and dx should be equal, the problem would be analogous to 

TABLE E 



Observed Data 

Graduated 


Exposed 

Deaths, 

e' -e' 

Exposed 

Deaths, 

X ^x> 

Age, 

to Risk, 

who do 

to Risk, 

who do 


K 

o'. 

not die 

K 

0". 

not die 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

20-24 

1,749 

15 

1,734 

1,749 

12 

1,737 

25-29 

15,989 

113 

15,876 

15,989 

116 

15,873 

30-34 

107,629 

864 

106,765 

107,629 

861 

106,768 

35-39 

346,276 

3,119 

343,157 

346,276 

3,136 

343,140 

40-44 

588,003 

6,461 

581,542 

588,003 

6,300 

581,703 

45-49 

728,094 

9,761 

718,333 

728,094 

9,698 

718,396 

50-54 

757,987 

13,071 

744,916 

757,987 

13,183 

744,804 

55-59 

701,051 i 

16,521 

684,530 

701,051 

16,636 

684,415 

60-64 

590,761 

19,628 

571,133 

590,761 

19,815 

570,946 

65-69 

442,842 

21,428 

421,414 

442,842 

21,527 

421,315 

70-74 

286,647 

20,731 

265,916 

286,647 

20,505 

266,142 

75-79 

151,977 

16,160 

135,817 

151,977 

16,105 

135,872 

80-84 

62,595 

10,003 

52,592 

62,595 

9,796 

52,799 

85-89 

18,059 

3,946 

14,113 

18,059 

4,131 

13,928 
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that already considered for Weldon’s dice data, and there would 
be one constraint, giving, with 14 groups, 14 — 1 ==13 degrees of 

freedom. Then x? would be found as S [ . =24.7582, 

L J 

as in column (3) of Table F. P corresponding to xo= 24.7682 
for 13 degrees of freedom is .027. This is below .05, and not 
much above .02; there is consequently strong reason to conclude 
that there is a real discrepancy between the observed deaths of 
column (3) and the values shown in column (6), if they had been 
obtained by direct graduation. 

TABLE F 


Age 

X 

(1) 

{e'.-e:) 

(2) 

bI 

(3) 

iK-e:) 

-iKO 

(4) 

[Col. (4)1* 

(KO 

(5) 

20-24 

3 

.7500 

-3 

.0052 

25-29 

- 3 

.0776 

3 

.0006 

30-34 

3 

.0104 

-3 

.0001 

35-39 

-17 

.0922 

17 

.0008 

40-44 

161 

4.1144 

-161 

.0446 

45-49 

63 

.4093 

-63 

.0055 

50-54 

-112 

.9515 

112 

.0168 

55-59 

-115 

.7950 

115 

.0193 

60-64 

-187 

1.7648 

187 

.0612 

65-69 

-99 

.4553 

99 

.0233 

70-74 

226 

2.4909 

-226 

.1919 

75-79 

55 

.1878 

-55 

.0223 

80-84 

207 

4.3741 

-207 

.8115 

85-89 

-185 

8.2849 

185 

2.4573 

Total 

•4-718 

-718 

X] -24. 7582 


X| = 3.6604 


Suppose now, however, that the graduated 0*, instead of 
being obtained as above by direct adjustment of the observed 
deaths only, have been computed from a graduation of some 
mortality ratio derived from the data of Table E, such as 



or 




in which both b'. 


and E'x are 
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involved. Then the position is that the observed data 
of both cols. (3) and (4) of Table E have contributed to the 
graduation, and have emerged from the process as cols. (6) and 
(7) of that table. The observed data, in fact, constitute a 
contingency table with 2 columns, 6^ and and 14 rows, 

and we desire to test whether the corresponding 2X14 graduated 
values in the second half of the table represent a good ‘‘fit’* by 
the X* test. In so doing the number of constraints imposed by 
the method of graduation employed must be taken into account. 
If, as on the previous supposition, the method provided only 
for an equality in the totals, the “degrees of freedom” for such 
a table would be (/> — l )($ — !) where ^=2 and 5 = 14, or 13 still. 
But suppose now — in order to deal with a case which occurs 
prominently in actuarial practice — that the graduation had been 
made by fitting Makeham’s formula colog — a+i9c*, and that 
the fitting had been performed by first choosing c arbitrarily 
(or assuming it from prior knowledge of other data) as log~^.039, 
and by imposing an equality in the total frequencies, as before, 
and now also in the first moments. The arbitrary selection of 
log c as .039 means that c has not been determined from the data, 
and thus has not in any way imposed a constraint, so that no 
degree of freedom should be deducted; the equality of totals is 
allowed for in (p — 1) (g — 1) = 13 ; but now an additional constraint 
results from the use of the first moments, so that the degrees of 
freedom become 13 — 1 = 12. From a slightly different viewpoint, 
there are 14 age groups; the formula actually fitted is a+/3c®, 
where c is assumed arbitrarily, and two constants a and P are 
actually determined from the data by moments; there is conse- 
quently no constraint in respect of c, but 2 for a and P; the 
degrees of freedom are therefore 14 — (0+2) =12. The calcu- 
lation of xo must in this case, since both 6^ and are in- 

volved, include the — elements, as shown in cols. (4) and 
(5) of Table F, which gives xo for both the B^ ^tnd Ex’—B',, series 


as 24.7582+3.6604 = 28.4186. These calculations, which give 
2 + 2 , may of course also be made 


E'-e: 


in the form 2 


IKpU j 


in accordance with the analysis on 
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p. 245, and as used in V:126\ib. Here P for the 12 degrees of 
freedom is .0066; and the conclusion to be drawn from so small 
a value of P is that there is a real discrepancy, not attributable 
to chance alone, between the observed and graduated tables, 
i.e., that the fit of Makeham’s formula is, on this criterion, not 
satisfactory, since deviations as large as or larger than those 
which have here produced 28.4186 for xo would occur from 
random sampling alone in only about 6 out of 1,000 trials. 

This case offers an interesting example of the desirability of 
proceeding beyond a routine application of the test in examin- 
ing the general goodness of fit of a curve to extensive mortality 
data. The test here indicates a need for care and further 
analysis — not necessarily rejection of the graduation. For it 
must be remembered, as noted at {Hi) on p. 113, that when the 
data, as here, are very large, P is often small even though the 
fit appears to be good. In practice, therefore, an examination 
should next be made of the general manner in which the devi- 
ations between the actual and expected deaths are balanced. 
Thus it will be seen from the following Table G that the features 
of the data are well reproduced over broader age groups. 

TABLE G 


Age Group 

Actual Deaths, 

Expected Deaths, 
€ 

Deviation, 

e'-e" 

X 

20-39 

4,111 

4,125 

-14 

40-59 

45,814 

45,817 

- 3 

60-79 

77,947 

77,952 

- 5 

80-89 

13,949 

13,927 

-i-22 


The smallness of the deviations over these broad ranges, and 
the satisfactory agreement which would also be found between 
the ungraduated and graduated annuity values, would amply 
justify the acceptance of the graduation in practice — particularly 
where the facilities of Makeham’s formula in the calculation of 
joint-life annuities might be important. 

In the illustration just given it was assumed that c was chosen 
arbitrarily, without any reference to the data; and under such 
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circumstances it is quite clear that there is no constraint, and 
consequently that no degree of freedom should be removed, in 
respect of that constant. If, however, c had been adjusted to 
the data by inspection, in order to give effect to a realization 
that a value derived from experiences with other material could 
not be assumed, the interpretation of the degrees of freedom is 
not so clear — for it cannot then be felt that there is no constraint 
at all with regard to c, although it is evident that one degree of 
freedom should not be deducted, and that the effect upon of 
the adjustment by inspection cannot be measured. 


C;26. Illustrations of the Concept of “Confidence” or 
“Fiducial” Limits 

The charts of Clopper and Pearson (P:f^:410-411) provide 
the confidence limits for values of w up to 1 ,000 with great facility. 
An example which they give for a small sample illustrates the 
procedure clearly: Out of 30 individuals, selected at random 

5 8 

from a population, 8 are observed to die, so that - = — = .267 ; 

n 30 

within what limits may p be expected to lie? If we are satisfied 
to accept a risk of error of not more than 1 in 20, or 6%, so that 
the prediction is to be based on a confidence coefficient of .95, we 

read at once from the .95 chart, for - =.267, that the lower 

n 

curve thereon for m= 30 gives ^ = .12, and the upper curve 
gives /) = .46; that is to say, p, which has been observed in this 
sample to be .267, may be expected in the long run to lie between 
the limits .12 and .46. If, on the other hand, we desire to have 
greater confidence in the prediction, and ^should only be satisfied 
with a risk of error of 1 in 100, the chart for the .99 confidence 
coefficient would be used, from which the limits are seen to be 
.09 and .52. By similar reasoning the graphs can be employed 
to determine the size of the sample necessary in order to attain 
a stated degree of accuracy in estimation (see P:f^:411-413). 

When the smallness of - =p' or of g' ( < about .03), and of np’ 
n 
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or wg' (< about 10), warns of the doubtful applicability of the 
normal distribution, and suggests instead the use of the Poisson 
exponential, Ricker’s table in P:f 05:354 may be used con- 
veniently for confidence coefficients of .95 or .99. For the pur- 
poses of these confidence limits it may be noted that, in com- 
parison with the approximate condition just stated (^'<.03, 
«/>'<10) for the use of Poisson’s distribution, Ricker considers 

that the Poisson values should be used when - >.01, and may 

n 

even be preferable up to .05. The close agreement of the Poisson 
and binomial results for this last value is shown by the example 

of 5 occurrences in 100 trials, so that - =.05 and np' = 5; for if 

n 

we desire a confidence of 99%, Ricker’s Poisson table gives p 
lying between .01 and .14, whereas Clopper and Pearson’s 
binomial chart indicates limits of .02 and .13. 

The use of the preceding charts and tables may be compared 
with the example at pp. 271-2; C; 6, for large values of n on the 
basis of the probability integral of the normal curve. 


C;27. Applications of Correlation Theory to Actuarial Prob- 
lems 

A completely worked numerical discussion of the correlation 
between the mean ages at maturity and the unexpired terms of 
endowment assurances is shown by Elderton in P:S;?:142-155, 
194, and 210-220. Correlation between stature of father and 
stature of son is noted in P:f77:199, 211, 238, and 246; cor- 
relation between births in a certain district and the proportion 
of male births per thousand of all births in England and Wales 
is shown at p. 213, loc. cit.; and the correlation between infantile 
mortality under 1 year of age and the general rate of mortality 
at all ages in England and Wales is examined at pp. 292-294 of 
the same volume. In P:4S:179-190 the calculations are given 
for correlation between the statures of fathers and daughters. 

Another interesting application is the construction — discussed 
in P:S1 — of the unknown values of, say, at a particular age 
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a in the calendar year z for community I, from the known 
value for community II, by calculating the correlation co- 
efficient and the regression equation between the known series 
the corresponding known series 
... for earlier calendar years. 

The multiple correlation method has been applied by E. C. 
Snow (H:n5; see also P:f^7:61) in the calculation of post- 
censal population estimates, on the assumption that the increase 
of population between two censuses may be expressed as a linear 
function of two or three different variables such as (a) the in- 
crease of births during the period over those of the preceding 
intercensal period, and the similar increase in the deaths, and 
in the marriages, or (b) the natural increase (i.e., births less 
deaths), and the increase in the number of inhabited houses, or 
(c) the increase in the inhabited houses and the increase in 
rateable values. 
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